Logistic Regression using Python & R

In this blog, we will learn to build a single variable logistic regression using Python and also interpret the model summary output.

Business Objective Overview

MyBank wishes to develop a Direct Marketing Channel to sell their loan products to existing deposit account customers. The bank executed a pilot campaign to cross-sell personal loans to its existing customers. A random base of 20000 customers was targeted with an attractive personal loan offer and processing fee waiver. The data of the customers who were targeted and their response to the marketing offer has been provided. The data is in the file (LR_DF.csv) and it can be downloaded from our resources section.

Data

The sample few records of the campaign data are shown below:

Logistic Regression Sample Data

Import Code

# Python code to import the data 
import pandas as pd 
LR_DF = pd.read_csv("LR_DF.csv") 
LR_DF.shape  

LR_DF.head()

# R code to import the data
LR_DF <- read.csv("LR_DF.csv")

dim(LR_DF)

View(LR_DF)

Metadata Understanding

Metadata is data about the data, i.e., it is the description of the data. It is very important to have metadata understanding before jumping into model development.

Sr. No.	Column Name	Description
1.	Cust_ID	Customer ID (Unique)
2.	Target	Dependent Variable (1: Responder to the campaign offer; 0: Non-responder)
3.	Age	Age of the customer
4.	Gender	Gender (M: Male, F: Female, O: Others)
5.	Balance	Average Quarterly Balance
6.	No_OF_CR_TXNS	Number of Credit Transaction in recent last 3 months
7.	AGE_BKT	Age Bucket
8.	SCR	Internal generic marketing score maintained by MyBank
9.	Holding_Period	The ability of the customer to retain (hold) money in the account. Unit: days

Descriptive Statistics and Exploratory Data Analysis

The next step after metadata understanding is performing a detailed descriptive analysis of each and every variable. We have explained Descriptive Statistics in detail in our Statistics Blog Series. Some of the important links you may wish to read are:

Build a Logistic Regression Model

Logistic Regression Model with One Continuous Independent Variable

Let us model the Target (dependent variable) with SCR (continuous independent variable).

# Syntax to build Logistic Regression Model in Python 
import statsmodels.formula.api as sm 
import statsmodels.api as sma 
# glm stands for Generalized Linear Model
mylogit = sm.glm( formula = "Target ~ Balance", 
    data = mydata_dev, 
    family = sma.families.Binomial() ).fit() 

mylogit.summary()

Logistic Regression Model Using Python

# Syntax to build Logistic Regression Model in R 

mylogit <- glm( formula = Target ~ SCR, 
    data = LR_DF, family = "binomial" ) 

summary(mylogit)

Single Variable Logistic Regression Model

Interpretation of Coefficients table

The above coefficients expressed in the logistic regression function would be:

Logistic Regression Model Equation

The beta coefficient of SCR is positive, it indicates that the probability (p) has a positive correlation with SCR.

Predicting Probabilities

Probability if the SCR of the customer is 700

Logistic Regression Probability Calculations

Logistic Regression Model with One Categorical Independent Variable

Let us model the Target (dependent variable) with Gender (categorical independent variable).

# Gender variable frequency distribution

# Gender variable has three categories – M, F, O

LR_DF["Gender"].value_counts()

F M O 5525 14279 196

mylogit = sm.glm(formula = "Target ~ Gender",
     data = mydata_dev, 
     family = sma.families.Binomial() 
     ).fit() 

mylogit.summary()

Logistic Regression Model using Python

# Gender variable frequency distribution

# Gender variable has three categories – M, F, O

table(LR_DF$Gender)

F M O 5525 14279 196

mylogit <- glm( formula = Target ~ Gender,   
     data = LR_DF, family = "binomial" ) 

summary(mylogit)

Logistic Regression Model Using R

Interpretation of Coefficients table

The above coefficients expressed in the logistic regression function would be:

Logistic Regression Model Equation

The beta coefficients of Gender variable are given only for two categories (M, O) out of the three categories (F, M, O). This is because one of the categories is considered as a baseline and its effect is captured in the intercept.
The choice of the baseline for the categorical variable is done based on alphabetical ascending order. In the above example, category F is considered as a baseline and its beta coefficient is 0.
The p-value of one of the categories of Gender “O” is not significant. How to handle insignificant variables is covered in our subsequent blogs.
Within a categorical variable, the category having the highest response rate should also have the highest beta coefficient. From the table shown below, we can see the beta coefficient is in order of the response rate.

Gender	Target = 1	Total Obs.	Response Prob.	Beta Coefficient
F	180	5525	0.033	0
M	700	14279	0.049	0.4258
O	8	196	0.041	0.2340

Predicting Probabilities

Probability if the customer is Male

Logistic Regression Probability Calculations

Practice Exercise

Build a model using SCR and Gender variable together
Estimate the probability assuming the customer SCR = 700 and Gender = M

<<< previous blog | next blog >>>
Logistic Regression blog series home