Outlier Detection Using Numerical Summary
1. Key Summary Statistics
A numerical summary typically includes the following statistics:
- Mean (μ): The average of all data points.
- Median (Q2): The middle value in the dataset when sorted.
- Interquartile Range (IQR): The difference between the 75th percentile (Q3) and the 25th percentile (Q1).
- Standard Deviation (σ): A measure of the spread of data points from the mean.
2. Identifying Outliers Based on IQR
The IQR method is a commonly used technique to identify outliers. Outliers are typically defined as values that fall below or above a certain threshold relative to the IQR.
Formula:
- Lower Bound: Q1−1.5×IQRQ1 – 1.5 \times IQR
- Upper Bound: Q3+1.5×IQRQ3 + 1.5 \times IQR
Values outside these bounds are considered potential outliers.
Example: Let’s assume the following numerical summary of a dataset:
- Q1 (25th percentile): 10
- Q3 (75th percentile): 20
- IQR (Q3 – Q1): 10
Using the IQR method:
- Lower Bound: 10−1.5×10=−510 – 1.5 \times 10 = -5
- Upper Bound: 20+1.5×10=3520 + 1.5 \times 10 = 35
Therefore, any data point below -5 or above 35 would be considered a potential outlier.
3. Identifying Outliers Based on Standard Deviation
Another method involves using the standard deviation of the dataset, particularly when the data is approximately normally distributed. Outliers are often defined as data points that fall beyond a certain number of standard deviations from the mean, typically 2 or 3 standard deviations.
Formula:
- Lower Bound: μ−2σ\mu – 2\sigma (for 2 standard deviations)
- Upper Bound: μ+2σ\mu + 2\sigma
Example: Suppose the mean is 50 and the standard deviation is 5. Using 2 standard deviations:
- Lower Bound: 50−2×5=4050 – 2 \times 5 = 40
- Upper Bound: 50+2×5=6050 + 2 \times 5 = 60
Any data point below 40 or above 60 would be considered a potential outlier.







