This blog is a continuation of our Linear Regression blog series. In this part of the blog, I will explain how a small variable transformation can improve the model performance drastically.

 

Importance of Variable Transformation and R Code

 

The code so far…

 

# Import the file. (File download link)
> inc_exp <- read.csv("Inc_Exp_Data.csv", header=T)

# Data Visualization – Scatter Plot
# run install.packages command only if package is not installed
> install.packages("psych") 
> library(psych)
> pairs.panels(
inc_exp[1:4], 
method = "pearson", # correlation method
hist.col = "#00AFBB",
density = TRUE, # show density plots
lm = TRUE # plot the linear fit
)

 

 

From pair plot we observe that the distribution of Emi_or_Rent_Amt variable is very skewed.

# Note: I am using only 3 explanatory variables. Other variables have been intentionally left for the students / blog readers to practice.

 

# Multiple Linear Model WITHOUT VARIABLE TRANSFORMATION
> m_linear_mod <- lm(
Mthly_HH_Expense ~ Mthly_HH_Income 
+ No_of_Fly_Members + Emi_or_Rent_Amt, 
data = inc_exp
)
> summary(m_linear_mod)$adj.r.squared
[1] 0.6830682

 

Variable Transformation

 

From the pair plot we observe that Emi_or_Rent_Amt variable is highly skewed. We will log transformation of Emi_or_Rent_Amt. By doing the log transformation the variable gets scaled and somewhat normally distributed.

 

> inc_exp$Ln_Emi_or_Rent_Amt = log(inc_exp$Emi_or_Rent_Amt + 1)

 

# Variable Distribution before and after transformation
# Before transformation

> hist(x=inc_exp$Emi_or_Rent_Amt, 
main = "Log Transformed Emi or Rent Amt",
xlab = "Histogram of Emi or Rent Amt", ylab = "Frequency",
col = "orange")

 

EMI RENT HISTOGRAM

 

# After transformation

> hist(x=inc_exp$Ln_Emi_or_Rent_Amt, 
main = "Log Transformed Emi or Rent Amt",
xlab = "Log (Emi or Rent Amt)", ylab = "Frequency",
col = "orange")

 

LOG EMI OR RENT

 

Multiple Linear Model WITH VARIABLE TRANSFORMATION

 

> m_linear_mod <- lm(
Mthly_HH_Expense ~ Mthly_HH_Income 
+ No_of_Fly_Members + Ln_Emi_or_Rent_Amt, 
data = inc_exp
)

> summary(m_linear_mod)$adj.r.squared
[1] 0.7435739

 

Key Take-aways

From log transformation of Emi_or_Rent_Amt variable, we can see that the R-Squared of the model improves drastically from 0.68 to 0.74

In linear regression there is no assumption that the explanatory (independent) variable should be normally distributed. However, if a highly skewed independent variable is made more symmetric with a transformation then the model performance can be improved.

Next Blog

In next blog, we will learn how to predict the estimated value using R code, compute residuals, RMSE and more.

<<< previous blog          |         next blog >>>

How can we help?

Share This

Share this post with your friends!