Bank_marketing_causal_regression_v2

Linear Regression for Causal Inference

Bank Marketing Case Study

This notebook illustrates Chapter 4 of Matheus Facure’s Causal Inference in Python using the Bank Marketing dataset.

Causal question:
Does contacting customers by cellular instead of telephone increase the probability of subscribing to a term deposit?

  • Treatment T: contact = cellular (1) vs telephone (0)
  • Outcome Y: subscription = yes (1) vs no (0)

We use the Bank Marketing dataset (Portuguese bank direct marketing campaigns).

Official source (UCI Machine Learning Repository)

  • Dataset page: https://archive.ics.uci.edu/dataset/222/bank+marketing
    (Direct downloadable files are linked on that page.)

Kaggle mirror (CSV download; requires Kaggle login)

  • https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset

Which file should you use?

  • Kaggle typically provides bank-full.csv / bank.csv
  • The UCI dataset provides multiple formats; this notebook loads from UCI via ucimlrepo

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import os

np.random.seed(42)
FIG_DIR = "figures_ch4_bank_marketing"
os.makedirs(FIG_DIR, exist_ok=True)

!pip install ucimlrepo
Collecting ucimlrepo
  Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: pandas>=1.0.0 in c:\users\revan\minicondanew\lib\site-packages (from ucimlrepo) (2.3.3)
Requirement already satisfied: certifi>=2020.12.5 in c:\users\revan\minicondanew\lib\site-packages (from ucimlrepo) (2025.11.12)
Requirement already satisfied: numpy>=1.26.0 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2.3.5)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2025.2)
Requirement already satisfied: six>=1.5 in c:\users\revan\minicondanew\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.17.0)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7

Load public Bank Marketing data (UCI)


from ucimlrepo import fetch_ucirepo

bank = fetch_ucirepo(id=222)
df = pd.concat([bank.data.features, bank.data.targets], axis=1)

df["Y"] = (df["y"].astype(str).str.lower() == "yes").astype(int)
df["contact"] = df["contact"].astype(str).str.lower()
df = df[df["contact"].isin(["cellular","telephone"])].copy()
df["T"] = (df["contact"]=="cellular").astype(int)

df.head()

age job marital education default balance housing loan contact day_of_week month duration campaign pdays previous poutcome y Y T
12657 27 management single secondary no 35 no no cellular 4 jul 255 1 -1 0 NaN no 0 1
12658 54 blue-collar married primary no 466 no no cellular 4 jul 297 1 -1 0 NaN no 0 1
12659 43 blue-collar married secondary no 105 no yes cellular 4 jul 668 2 -1 0 NaN no 0 1
12660 31 technician single secondary no 19 no no telephone 4 jul 65 2 -1 0 NaN no 0 0
12661 27 technician single secondary no 126 yes yes cellular 4 jul 436 4 -1 0 NaN no 0 1

Naive regression (difference in means)


m_naive = smf.ols("Y ~ T", data=df).fit()
m_naive.summary().tables[1]

coef std err t P>|t| [0.025 0.975]
Intercept 0.1342 0.007 20.384 0.000 0.121 0.147
T 0.0150 0.007 2.171 0.030 0.001 0.029

Adjusted regression with month fixed effects


num_controls = ["age","balance","campaign","pdays","previous","day"]
num_controls = [c for c in num_controls if c in df.columns]

cat_controls = ["job","marital","education","housing","loan","month","poutcome"]
cat_controls = [c for c in cat_controls if c in df.columns]

formula = "Y ~ T"
for c in num_controls:
    formula += f" + {c}"
for c in cat_controls:
    formula += f" + C({c})"

m_adj = smf.ols(formula, data=df).fit()
m_adj.summary().tables[1]

coef std err t P>|t| [0.025 0.975]
Intercept 0.0799 0.040 2.003 0.045 0.002 0.158
C(job)[T.blue-collar] -0.0197 0.015 -1.313 0.189 -0.049 0.010
C(job)[T.entrepreneur] -0.0317 0.027 -1.169 0.243 -0.085 0.022
C(job)[T.housemaid] -0.0374 0.032 -1.178 0.239 -0.100 0.025
C(job)[T.management] 0.0140 0.016 0.865 0.387 -0.018 0.046
C(job)[T.retired] 0.0218 0.024 0.928 0.354 -0.024 0.068
C(job)[T.self-employed] 0.0060 0.025 0.239 0.811 -0.043 0.055
C(job)[T.services] 0.0029 0.017 0.167 0.867 -0.031 0.037
C(job)[T.student] 0.0855 0.027 3.213 0.001 0.033 0.138
C(job)[T.technician] -0.0068 0.015 -0.458 0.647 -0.036 0.022
C(job)[T.unemployed] 0.0746 0.027 2.773 0.006 0.022 0.127
C(marital)[T.married] 0.0178 0.013 1.355 0.175 -0.008 0.043
C(marital)[T.single] 0.0261 0.015 1.744 0.081 -0.003 0.055
C(education)[T.secondary] 0.0108 0.014 0.798 0.425 -0.016 0.037
C(education)[T.tertiary] 0.0287 0.017 1.727 0.084 -0.004 0.061
C(housing)[T.yes] -0.1011 0.010 -10.009 0.000 -0.121 -0.081
C(loan)[T.yes] -0.0382 0.012 -3.237 0.001 -0.061 -0.015
C(month)[T.aug] 0.1050 0.020 5.240 0.000 0.066 0.144
C(month)[T.dec] 0.1243 0.036 3.495 0.000 0.055 0.194
C(month)[T.feb] -0.0059 0.016 -0.359 0.720 -0.038 0.026
C(month)[T.jan] -0.0727 0.020 -3.674 0.000 -0.112 -0.034
C(month)[T.jul] 0.1828 0.027 6.877 0.000 0.131 0.235
C(month)[T.jun] 0.1547 0.024 6.543 0.000 0.108 0.201
C(month)[T.mar] 0.2091 0.030 6.869 0.000 0.149 0.269
C(month)[T.may] -0.0257 0.013 -1.969 0.049 -0.051 -0.000
C(month)[T.nov] -0.0323 0.016 -2.044 0.041 -0.063 -0.001
C(month)[T.oct] 0.1450 0.023 6.182 0.000 0.099 0.191
C(month)[T.sep] 0.2023 0.024 8.306 0.000 0.155 0.250
C(poutcome)[T.other] 0.0334 0.010 3.318 0.001 0.014 0.053
C(poutcome)[T.success] 0.4063 0.012 34.528 0.000 0.383 0.429
T 0.0414 0.016 2.639 0.008 0.011 0.072
age 0.0008 0.000 1.569 0.117 -0.000 0.002
balance 3.072e-06 1.32e-06 2.320 0.020 4.76e-07 5.67e-06
campaign -0.0144 0.003 -5.478 0.000 -0.020 -0.009
pdays 0.0001 4.29e-05 3.052 0.002 4.69e-05 0.000
previous 0.0018 0.001 2.059 0.040 8.62e-05 0.004

Frisch–Waugh–Lovell (FWL) theorem


f_T = "T ~ age + balance + C(month)"
f_Y = "Y ~ age + balance + C(month)"

mT = smf.ols(f_T, data=df).fit()
mY = smf.ols(f_Y, data=df).fit()

df["T_res"] = mT.resid
df["Y_res"] = mY.resid

m_fwl = smf.ols("Y_res ~ T_res", data=df).fit()
m_fwl.summary().tables[1]

coef std err t P>|t| [0.025 0.975]
Intercept 7.752e-15 0.002 4.09e-12 1.000 -0.004 0.004
T_res 0.0348 0.007 5.130 0.000 0.022 0.048

Heterogeneous effects (interaction with month)


m_inter = smf.ols("Y ~ T*C(month) + age + balance", data=df).fit()
m_inter.summary().tables[1]

coef std err t P>|t| [0.025 0.975]
Intercept 0.1503 0.026 5.840 0.000 0.100 0.201
C(month)[T.aug] -0.0609 0.032 -1.880 0.060 -0.124 0.003
C(month)[T.dec] 0.1768 0.061 2.907 0.004 0.058 0.296
C(month)[T.feb] -0.0500 0.032 -1.579 0.114 -0.112 0.012
C(month)[T.jan] -0.0935 0.038 -2.433 0.015 -0.169 -0.018
C(month)[T.jul] -0.1160 0.027 -4.333 0.000 -0.168 -0.064
C(month)[T.jun] 0.0530 0.045 1.179 0.239 -0.035 0.141
C(month)[T.mar] 0.2151 0.053 4.096 0.000 0.112 0.318
C(month)[T.may] -0.1171 0.029 -4.057 0.000 -0.174 -0.061
C(month)[T.nov] -0.0911 0.030 -3.063 0.002 -0.149 -0.033
C(month)[T.oct] 0.2386 0.038 6.242 0.000 0.164 0.314
C(month)[T.sep] 0.1794 0.048 3.713 0.000 0.085 0.274
T 0.0127 0.025 0.508 0.612 -0.036 0.062
T:C(month)[T.aug] -0.0275 0.033 -0.826 0.409 -0.093 0.038
T:C(month)[T.dec] 0.1171 0.066 1.764 0.078 -0.013 0.247
T:C(month)[T.feb] 0.0232 0.033 0.703 0.482 -0.042 0.088
T:C(month)[T.jan] 0.0015 0.040 0.036 0.971 -0.077 0.080
T:C(month)[T.jul] 0.0183 0.028 0.656 0.512 -0.036 0.073
T:C(month)[T.jun] 0.1884 0.047 3.995 0.000 0.096 0.281
T:C(month)[T.mar] 0.1182 0.055 2.130 0.033 0.009 0.227
T:C(month)[T.may] 0.0429 0.030 1.431 0.152 -0.016 0.102
T:C(month)[T.nov] -0.0110 0.031 -0.355 0.723 -0.072 0.050
T:C(month)[T.oct] 0.0058 0.041 0.141 0.888 -0.075 0.087
T:C(month)[T.sep] 0.1375 0.051 2.684 0.007 0.037 0.238
age 0.0007 0.000 3.828 0.000 0.000 0.001
balance 4.007e-06 6.04e-07 6.631 0.000 2.82e-06 5.19e-06

Key takeaways

  • Regression = adjusted comparison
  • Month fixed effects remove seasonality bias
  • FWL explains why controls work
  • Interactions show when marketing works better

Written on December 22, 2025