Missing Value Imputation

In statistics, imputation is the process of substituting the missing values in the data with some appropriate values.

Why impute the missing value?

Because statistical packages discard the record/case having missing data in any column. Let us understand this with a practical dataset.

In MyBank Personal Loans Cross-Sell data, the Occupation field has some missing values. If we build a logistic regression model using the occupation field, the records having missing occupation will get discarded.

(Note: This blog is a continuation of our Logistic Regression Blog Series)

count_row = dev.shape[0]

count_occupation = dev["Occupation"].count()
print("No. of rows =",  count_row)
print("Occupation count =",  count_occupation)
print("No. of rows with missing occupation =", (count_row - count_occupation))

No. of rows = 10000
Occupation count = 7706
No. of rows with missing occupation = 2294

import statsmodels.formula.api as sm
import statsmodels.api as sma


mylogit = sm.glm(
    formula = "Target ~ Occupation", data = dev,
    family = sma.families.Binomial()
).fit()

# mylog.summary()
mylogit.nobs

7706

The model has considered only 7706 records out of the 10000 observations. The rows with missing occupation have been discarded.

How to impute missing values?

So far, we have understood the importance of missing value imputation. It’s time we learn to impute the missing values. There various ways of imputing missing values are:

Do nothing
Mean, Median, or Mode value imputation
K Nearest Neighbour (Non-Parametric Technique)
Regression Techniques
Surrogates or Logical Imputation of Missing Value
Multivariate Imputation by Chained Equation (MICE)

I prefer to impute the values based on an approach that is logical and intuitively understandable.

Do nothing

The “Do Nothing” approach of imputation can be used only for categorical variables. In this approach, we treat missing values as a separate category and replace the missing values with some text string like “MISSING”, “NOT AVAILABLE”.

This approach cannot be used for numerical variables.

# Python code for replacing missing values by text string MISSING

dev["Occ_Imputed"]= dev["Occupation"].fillna("MISSING")

# R code for impuatation

# If the Occupation column is factor

class(dev$Occupation) # to check the class of Occupation column
dev$Occ_Imputed = dev$Occupation
levels(dev$Occ_Imputed)[1] = "MISSING"

# Alternate code for imputation


dev$Occ_Imputed = as.character(dev$Occupation)
dev$Occ_Imputed = ifelse(is.na(dev$Occ_Imputed),
       "MISSING", dev$Occ_Imputed)

Mode value imputation

The Mode imputation can be used only for categorical variables and preferably when the missingness in overall data is less than 2 – 3%.

In MyBank Personal Loans Cross-Sell data, the occupation field has missing values in 2294 observations out of 10000 i.e. 22.94%. As such, we cannot simply replace the missing with the most frequent (i.e. mode) category in the data.

Mean or Median imputation

This method of imputation can be used only for continuous variables and preferably when the missingness in overall data is less than 2 – 3%. If the proportion of missingness in data is large, then using the mean/median will change the distribution of the overall data.

K Nearest Neighbours (KNN)

K Nearest Neighbours is a non-parametric supervised machine learning technique. In this approach, we build a KNN model on a sample dataset with no missing values for learning the patterns in the data. The KNN model is then used to predict the value of the missing cases. It can be used for the imputation of both categorical and numerical variables.

Regression

In regression imputation, we consider the column having missing values (as a dependent variable) and certain other columns related to it as independent variables. We build a regression model on a sample dataset having no missing values. The regression model is eventually used to predict the missing values.

Regression cannot be used for the imputation of missing values in a Linear or Logistic Regression Model as it will lead to the problem of multi-collinearity.

Surrogate or Logical Imputation

Surrogate or logical approach of imputation in certain scenarios, e.g.:

The missing values in Gender can be imputed from Title.
We can impute Monthly Income from Annual Income.

Multivariate Imputation by Chained Equation (MICE)

The MICE package as available in R and Python is one of the commonly used packages by Data Scientists to impute the missing values. In the MICE package, the imputation is done based on the built-in imputation models.

In the upcoming blog, we will see missing value imputation using the KNN technique.

<<< previous blog | next blog >>>
Logistic Regression blog series home