Building A Logistic Regression in Python, Step by Step


Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable. In logistic regression, the dependent variable is a binary variable that contains data coded as 1 (yes, success, etc.) or 0 (no, failure, etc.). In other words, the logistic regression model predicts P(Y=1) as a function of X.

Logistic Regression Assumptions

  • Binary logistic regression requires the dependent variable to be binary.
  • For a binary regression, the factor level 1 of the dependent variable should represent the desired outcome.
  • Only the meaningful variables should be included.
  • The independent variables should be independent of each other. That is, the model should have little or no multicollinearity.
  • The independent variables are linearly related to the log odds.
  • Logistic regression requires quite large sample sizes.
Keeping the above assumptions in mind, let’s look at our dataset.

Data

The dataset comes from the UCI Machine Learning repository, and it is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict whether the client will subscribe (1/0) to a term deposit (variable y). The dataset can be downloaded from here.

import pandas as pd
import numpy as np
from sklearn import preprocessing
import matplotlib.pyplot as plt
plt.rc("font", size=14)
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import seaborn as sns
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)


The dataset provides the bank customers’ information. It includes 41,188 records and 21 fields.


Input variables
  1. age (numeric)
  2. job : type of job (categorical: “admin”, “blue-collar”, “entrepreneur”, “housemaid”, “management”, “retired”, “self-employed”, “services”, “student”, “technician”, “unemployed”, “unknown”)
  3. marital : marital status (categorical: “divorced”, “married”, “single”, “unknown”)
  4. education (categorical: “basic.4y”, “basic.6y”, “basic.9y”, “high.school”, “illiterate”, “professional.course”, “university.degree”, “unknown”)
  5. default: has credit in default? (categorical: “no”, “yes”, “unknown”)
  6. housing: has housing loan? (categorical: “no”, “yes”, “unknown”)
  7. loan: has personal loan? (categorical: “no”, “yes”, “unknown”)
  8. contact: contact communication type (categorical: “cellular”, “telephone”)
  9. month: last contact month of year (categorical: “jan”, “feb”, “mar”, …, “nov”, “dec”)
  10. day_of_week: last contact day of the week (categorical: “mon”, “tue”, “wed”, “thu”, “fri”)
  11. duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y=’no’). The duration is not known before a call is performed, also, after the end of the call, y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model
  12. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
  13. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
  14. previous: number of contacts performed before this campaign and for this client (numeric)
  15. poutcome: outcome of the previous marketing campaign (categorical: “failure”, “nonexistent”, “success”)
  16. emp.var.rate: employment variation rate — (numeric)
  17. cons.price.idx: consumer price index — (numeric)
  18. cons.conf.idx: consumer confidence index — (numeric)
  19. euribor3m: euribor 3 month rate — (numeric)
  20. nr.employed: number of employees — (numeric)

Building A Logistic Regression in Python, Step by Step

0 Comments

Post a Comment