Information Value and Weight of Evidence (WoE) are the two most used concepts in Logistic Regression for variable selection and variable transformation respectively. Information Value helps quantify the predictive power of a variable in separating the Good Customers from the Bad Customers. Whereas, WoE is used for the transformation of categorical variables to continuous.
Pre-reads: Information Value and Variable Transformation
Understanding WoE Calculations
WoE is calculated by taking the natural logarithm (log to base e) of the ratio of %Good by %Bad.
Weight of Evidence Formula
The table below shows the Weight of Evidence calculations for the Occupation field. I will walk you through step-by-step calculations to compute WoE.
Occ_Imputed | cnt_resp | cnt_non_resp | pct_resp | pct_non_resp | WOE |
---|---|---|---|---|---|
MISSING | 91 | 2203 | 0.197826 | 0.230922 | -0.154694 |
PROF | 121 | 2613 | 0.263043 | 0.273899 | -0.040441 |
SAL | 86 | 2901 | 0.186957 | 0.304088 | -0.486441 |
SELF-EMP | 156 | 1487 | 0.339130 | 0.155870 | 0.777362 |
SENP | 6 | 336 | 0.013043 | 0.035220 | -0.993329 |
Step 1: Get the frequency count of the dependent variable class by the independent variable. This step will give the first three columns of the above table.
- Occ_Imputed: Independent Varible.
- cnt_resp: Count of Responders, i.e Target = 1
- cnt_non_resp: Count of Non-Responders i.e Target = 0
# Crosstab code in Python pd.crosstab(dev["Occ_Imputed"], dev["Target"]) # Crosstab code in R table(dev$Occ_Imputed, dev$Target) # Note - The Development Sample of R and Python is not exactly the same. # As such, you can expect some difference in R and Python crosstab output.
Step 2: Convert the count values into proportions. The formula is count responders divided by total responders and likewise count non-responders divided by total non-responders.
Occ_Imputed | cnt_resp | cnt_non_resp | pct_resp | pct_non_resp |
---|---|---|---|---|
MISSING | 91 | 2203 | 91/460 = 0.198 | 2203/9540 = 0.231 |
PROF | 121 | 2613 | 121/460 = 0.263 | 2613/9540 = 0.274 |
SAL | 86 | 2901 | 86/140 = 0.187 | 2901/9540 = 0.304 |
SELF-EMP | 156 | 1487 | 156/140 – 0.339 | 1487/9540 = 0.156 |
SENP | 6 | 336 | 6/140 = 0.013 | 336/9540 = 0.035 |
Total | 460 | 9540 |
Step 3: Calculate WoE by taking the natural log of the ratio of Responders proportion divided by Non-Responders.
Occ_Imputed | cnt_resp | cnt_non_resp | pct_resp | pct_non_resp | WOE |
---|---|---|---|---|---|
MISSING | 91 | 2203 | 0.198 | 0.231 | ln(0.198/0.231) = -0.155 |
PROF | 121 | 2613 | 0.263 | 0.274 | -0.040441 |
SAL | 86 | 2901 | 0.187 | 0.304 | -0.486441 |
SELF-EMP | 156 | 1487 | 0.339 | 0.156 | 0.777362 |
SENP | 6 | 336 | 0.013 | 0.035 | -0.993329 |
Python code to compute WoE
We have automated the above WoE calculation in the k2_iv_woe_function.py file. You can download the k2_iv_woe_function.py file from Github.
exec(open("k2_iv_woe_function.py").read()) woe_table = woe(df=dev, target="Target",var="Occ_Imputed", bins = 10, fill_na = True) woe_table
Application of WoE for Variable Transformation
The WoE can be used to transform Categorical Variable to Numerical. You do this by substituting each category by their respective WoE value. The benefit of WoE transformation is that the WoE transformed variable has a linear relationship with the log odds. To understand it better, execute the below code and see its Ln Odds Visualization chart.
Benefits of using WoE in Logistic Regression
1. Does away with One-Hot Encoding: Some of the machine learning packages do not take the categorical variables directly. You have to convert the categorical variables into a dummy 1-0 matrix also called one-hot encoding. If there are many categories in the categorical variable then, it would add many columns in the dataset. We can do away with the one-hot encoding by using the WoE step.
2. Only One Beta Coefficient: A categorical variable with “n” categories will result in having “n-1” beta coefficients in the model. However, converting a categorical variable to its WoE equivalent will have only one beta coefficient thereby simplifying the model equation.
<<< previous blog | next blog >>>
Logistic Regression blog series home
Recent Comments