This blog is a continuation of our Linear Regression blog series. In this part of the blog, I will explain how a small variable transformation can improve the model performance drastically.
Importance of Variable Transformation and R Code
The code so far…
# Import the file. (File download link) > inc_exp <- read.csv("Inc_Exp_Data.csv", header=T) # Data Visualization – Scatter Plot # run install.packages command only if package is not installed > install.packages("psych") > library(psych) > pairs.panels( inc_exp[1:4], method = "pearson", # correlation method hist.col = "#00AFBB", density = TRUE, # show density plots lm = TRUE # plot the linear fit )
From pair plot we observe that the distribution of Emi_or_Rent_Amt variable is very skewed.
# Note: I am using only 3 explanatory variables. Other variables have been intentionally left for the students / blog readers to practice.
# Multiple Linear Model WITHOUT VARIABLE TRANSFORMATION > m_linear_mod <- lm( Mthly_HH_Expense ~ Mthly_HH_Income + No_of_Fly_Members + Emi_or_Rent_Amt, data = inc_exp ) > summary(m_linear_mod)$adj.r.squared [1] 0.6830682
Variable Transformation
From the pair plot we observe that Emi_or_Rent_Amt variable is highly skewed. We will log transformation of Emi_or_Rent_Amt. By doing the log transformation the variable gets scaled and somewhat normally distributed.
> inc_exp$Ln_Emi_or_Rent_Amt = log(inc_exp$Emi_or_Rent_Amt + 1)
# Variable Distribution before and after transformation
# Before transformation
> hist(x=inc_exp$Emi_or_Rent_Amt, main = "Log Transformed Emi or Rent Amt", xlab = "Histogram of Emi or Rent Amt", ylab = "Frequency", col = "orange")
# After transformation
> hist(x=inc_exp$Ln_Emi_or_Rent_Amt, main = "Log Transformed Emi or Rent Amt", xlab = "Log (Emi or Rent Amt)", ylab = "Frequency", col = "orange")
Multiple Linear Model WITH VARIABLE TRANSFORMATION
> m_linear_mod <- lm( Mthly_HH_Expense ~ Mthly_HH_Income + No_of_Fly_Members + Ln_Emi_or_Rent_Amt, data = inc_exp ) > summary(m_linear_mod)$adj.r.squared [1] 0.7435739
Key Take-aways
From log transformation of Emi_or_Rent_Amt variable, we can see that the R-Squared of the model improves drastically from 0.68 to 0.74
In linear regression there is no assumption that the explanatory (independent) variable should be normally distributed. However, if a highly skewed independent variable is made more symmetric with a transformation then the model performance can be improved.
Next Blog
In next blog, we will learn how to predict the estimated value using R code, compute residuals, RMSE and more.
Recent Comments