We have discussed various missing value imputation techniques in previous blogs. In this blog, we will do the missing value imputation of a continuous variable using mean, mode, and business logic approach.
Dataset
We will continue with the development sample as created in the training and testing step. The continuous variable, Holding Period, has missing values in it. Let us check the missing.
print("No. of rows =", dev.shape[0]) print("Occupation count =", dev["Holding_Period"].count()) print("No. of rows with missing values =", (dev.shape[0] - dev["Holding_Period"].count()))
No. of rows = 10000 Occupation count = 9484 No. of rows with missing values = 516
Imputation using Mean/Median Value
The simplest approach of imputing a continuous variable is to replace all missing values by Mean or Median.
print("Mean Holding Period = ", dev["Holding_Period"].mean().round(1)) print("Median Holding Period = ", dev["Holding_Period"].median().round(1))
Mean Holding Period = 15.3 Median Holding Period = 15.0
Python code to replace the missing by Mean / Median
dev["HP_Mean_Imputation"] = dev["Holding_Period"].fillna( dev["Holding_Period"].mean().round(1)) dev["HP_Median_Imputation"] = dev["Holding_Period"].fillna( dev["Holding_Period"].median().round(1))
Missing Value Imputation using Business Logic
Reminder – The data being analyzed is Personal Loans Cross-Sell data. As a marketer, I am more interested in the segment-wise response rate. I would like to see the response rate in the missing value segment and compare it with other segments. Accordingly, then impute the value of missing segment with the segment value where the response rate is almost the same. Let us see this using Python code.
In the code, we are converting the continuous variable Holding Period into a categorical variable by coarse binning. In the coarse bins, we have replaced the missing values by -9999.
dev['coarse_bins'] = pd.qcut(dev['Holding_Period'], 5, labels=False) Rrate = dev.fillna(-9999).groupby('coarse_bins').apply(lambda x: pd.Series([ np.min(x['Holding_Period']), np.max(x['Holding_Period']), np.mean(x['Holding_Period']).round(1), np.size(x['Holding_Period']), np.sum(x['Target']), np.size(x['Target'][x['Target']==0]), ], index=(["min_HP","max_HP","avg_HP", "cnt","cnt_resp","cnt_non_resp"]) )).reset_index() Rrate["rrate"]=Rrate["cnt_resp"]*100/Rrate["cnt"] Rrate["rrate"] = Rrate["rrate"].round(2) Rrate
coarse_bins | min_HP | max_HP | avg_HP | cnt | cnt_resp | cnt_non_resp | rrate | |
---|---|---|---|---|---|---|---|---|
0 | -9999.0 | -9999.0 | -9999.0 | -9999.0 | 516.0 | 19.0 | 497.0 | 3.68 |
1 | 0.0 | 0.0 | 6.0 | 3.0 | 2119.0 | 183.0 | 1936.0 | 8.64 |
2 | 1.0 | 7.0 | 12.0 | 9.5 | 1788.0 | 94.0 | 1694.0 | 5.26 |
3 | 2.0 | 13.0 | 18.0 | 15.5 | 1873.0 | 83.0 | 1790.0 | 4.43 |
4 | 3.0 | 19.0 | 25.0 | 22.0 | 1999.0 | 50.0 | 1949.0 | 2.50 |
5 | 4.0 | 26.0 | 31.0 | 28.5 | 1705.0 | 31.0 | 1674.0 | 1.82 |
Key takeaways from the above table
- Response Rate of missing value segment is 3.68%
- The segment with an average Holding Period of 15.5 has a response rate of 4.43%
- Imputing the missing value by 15 would imply that we will merge the missings in a segment having a relatively higher response rate (4.43%) as against the missing value response rate of 3.68%. This can lead to model overfitting.
- The response rate of the segment having average HP as 15.5 is 4.43% and the response rate of the segment having average HP as 22 is 2.50%. The response rate of the missing segment is 3.68% and is between the segments having a 4.43% and 2.50% response rate.
HP | RR |
15.5 | 4.43 |
22 | 2.5 |
?? | 3.68 |
Imputation step
The value to be used for imputing the missing can be calculated by interpolation.
(?-22) / (3.68 -2.5) = (22 -15.5) / (2.5 -4.43)
Based on interpolation we get the value 18 to be used for replacing the missing.
dev["HP_Imputed"] = dev["Holding_Period"].fillna(18)
Great! You have learned various techniques of missing value imputation. we now move on to the next important topic – Visualization and Pattern Detection
<<< previous blog | next blog >>>
Logistic Regression blog series home
Recent Comments