Outlier
Outliers are the extreme values in the data. If the value of a variable is too large or too small, i.e, if the value is beyond a certain acceptable range then we consider that value to be an outlier. A quick way to find outliers in the data is by using a Box Plot.
Outlier Treatment
The treatment of the outlier values/cases is called Outlier Treatment. Typically outlier treatment is done by capping/flooring.
- Capping is replacing all higher side values exceeding a certain theoretical maximum or upper control limit (UCL) by the UCL value. Statistical formula for UCL is UCL = Q3 + 1.5 * IQR
- Flooring is replacing all values falling below a certain theoretical minimum or lower control limit (UCL) by the LCL value. Statistical formula for LCL is LCL = Q1 – 1.5 * IQR
There may be some instances where you may want to delete the record having an outlier value. However, the deletion of a record should be considered as an option only when other outlier treatment options are not acceptable.
Note: This blog is a continuation of our Logistic Regression Blog Series
Python code | Finding Outlier using Box Plot
import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline plt.figure(figsize=(9,5)) boxplot = sns.boxplot(x="Balance", data=dev, showmeans=True, width=0.5, palette="colorblind") plt.title("Box Plot of Balance", fontsize=20) plt.xlabel("Balance", fontsize=15)
From the box plot, we observe that there are outlier values after 500000.
We compute the Upper Control Limit using the formula: UCL = Q3 + 1.5 * IQR
Python code | Compute UCL
#Getting Upper Control Limit value for Balance Q1, Q3 = dev["Balance"].quantile([0.25,0.75]) UCL = Q3 + 1.5 * (Q3 - Q1) print("UCL = ", round(UCL))
Python code | Capping of Outlier Values
# If value above 500000 then replace by 500000 ####### Best Practice ####### # when you do outlier treatment, you should create a new variable dev["Bal_cap"] = dev["Balance"].map( lambda x: 500000 if x > 500000 else x )
R code for Outlier Treatment
The Python equivalent code in R is given below.
# Box Plot boxplot(dev$Balance, main = "Box Plot of Balance", xlab = "Balance", col = "royalblue", border = "black", horizontal = TRUE) # UCL - Upper Control Limit Q = quantile(dev$Balance, c( 0.25, 0.75))
Q1 = Q[1]
Q3 = Q[2]
UCL = Q3 + 1.5 * (Q3 - Q1) cat("UCL =" , round(UCL,0)) # Capping the Balance variable # Creating new variable Bal_cap dev$Bal_cap = ifelse(dev$Balance > 500000, 500000, dev$Balance)
<<< previous blog | next blog >>>
Logistic Regression blog series home
Recent Comments