Linear Regression for Causal Inference
Bank Marketing Case Study
This notebook illustrates Chapter 4 of Matheus Facure’s Causal Inference in Python
using the Bank Marketing dataset.
Causal question:
Does contacting customers by cellular instead of telephone increase the probability
of subscribing to a term deposit?
- Treatment
T: contact = cellular (1) vs telephone (0)
- Outcome
Y: subscription = yes (1) vs no (0)
Dataset (links + download options)
We use the Bank Marketing dataset (Portuguese bank direct marketing campaigns).
Official source (UCI Machine Learning Repository)
- Dataset page: https://archive.ics.uci.edu/dataset/222/bank+marketing
(Direct downloadable files are linked on that page.)
Kaggle mirror (CSV download; requires Kaggle login)
- https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset
Which file should you use?
- Kaggle typically provides
bank-full.csv / bank.csv
- The UCI dataset provides multiple formats; this notebook loads from UCI via
ucimlrepo
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import os
np.random.seed(42)
FIG_DIR = "figures_ch4_bank_marketing"
os.makedirs(FIG_DIR, exist_ok=True)
Collecting ucimlrepo
Downloading ucimlrepo-0.0.7-py3-none-any.whl.metadata (5.5 kB)
Requirement already satisfied: pandas>=1.0.0 in c:\users\revan\minicondanew\lib\site-packages (from ucimlrepo) (2.3.3)
Requirement already satisfied: certifi>=2020.12.5 in c:\users\revan\minicondanew\lib\site-packages (from ucimlrepo) (2025.11.12)
Requirement already satisfied: numpy>=1.26.0 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2.3.5)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2025.2)
Requirement already satisfied: six>=1.5 in c:\users\revan\minicondanew\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.17.0)
Downloading ucimlrepo-0.0.7-py3-none-any.whl (8.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.7
Load public Bank Marketing data (UCI)
from ucimlrepo import fetch_ucirepo
bank = fetch_ucirepo(id=222)
df = pd.concat([bank.data.features, bank.data.targets], axis=1)
df["Y"] = (df["y"].astype(str).str.lower() == "yes").astype(int)
df["contact"] = df["contact"].astype(str).str.lower()
df = df[df["contact"].isin(["cellular","telephone"])].copy()
df["T"] = (df["contact"]=="cellular").astype(int)
df.head()
|
age |
job |
marital |
education |
default |
balance |
housing |
loan |
contact |
day_of_week |
month |
duration |
campaign |
pdays |
previous |
poutcome |
y |
Y |
T |
| 12657 |
27 |
management |
single |
secondary |
no |
35 |
no |
no |
cellular |
4 |
jul |
255 |
1 |
-1 |
0 |
NaN |
no |
0 |
1 |
| 12658 |
54 |
blue-collar |
married |
primary |
no |
466 |
no |
no |
cellular |
4 |
jul |
297 |
1 |
-1 |
0 |
NaN |
no |
0 |
1 |
| 12659 |
43 |
blue-collar |
married |
secondary |
no |
105 |
no |
yes |
cellular |
4 |
jul |
668 |
2 |
-1 |
0 |
NaN |
no |
0 |
1 |
| 12660 |
31 |
technician |
single |
secondary |
no |
19 |
no |
no |
telephone |
4 |
jul |
65 |
2 |
-1 |
0 |
NaN |
no |
0 |
0 |
| 12661 |
27 |
technician |
single |
secondary |
no |
126 |
yes |
yes |
cellular |
4 |
jul |
436 |
4 |
-1 |
0 |
NaN |
no |
0 |
1 |
Naive regression (difference in means)
m_naive = smf.ols("Y ~ T", data=df).fit()
m_naive.summary().tables[1]
| coef | std err | t | P>|t| | [0.025 | 0.975] |
| Intercept | 0.1342 | 0.007 | 20.384 | 0.000 | 0.121 | 0.147 |
| T | 0.0150 | 0.007 | 2.171 | 0.030 | 0.001 | 0.029 |
Adjusted regression with month fixed effects
num_controls = ["age","balance","campaign","pdays","previous","day"]
num_controls = [c for c in num_controls if c in df.columns]
cat_controls = ["job","marital","education","housing","loan","month","poutcome"]
cat_controls = [c for c in cat_controls if c in df.columns]
formula = "Y ~ T"
for c in num_controls:
formula += f" + {c}"
for c in cat_controls:
formula += f" + C({c})"
m_adj = smf.ols(formula, data=df).fit()
m_adj.summary().tables[1]
| coef | std err | t | P>|t| | [0.025 | 0.975] |
| Intercept | 0.0799 | 0.040 | 2.003 | 0.045 | 0.002 | 0.158 |
| C(job)[T.blue-collar] | -0.0197 | 0.015 | -1.313 | 0.189 | -0.049 | 0.010 |
| C(job)[T.entrepreneur] | -0.0317 | 0.027 | -1.169 | 0.243 | -0.085 | 0.022 |
| C(job)[T.housemaid] | -0.0374 | 0.032 | -1.178 | 0.239 | -0.100 | 0.025 |
| C(job)[T.management] | 0.0140 | 0.016 | 0.865 | 0.387 | -0.018 | 0.046 |
| C(job)[T.retired] | 0.0218 | 0.024 | 0.928 | 0.354 | -0.024 | 0.068 |
| C(job)[T.self-employed] | 0.0060 | 0.025 | 0.239 | 0.811 | -0.043 | 0.055 |
| C(job)[T.services] | 0.0029 | 0.017 | 0.167 | 0.867 | -0.031 | 0.037 |
| C(job)[T.student] | 0.0855 | 0.027 | 3.213 | 0.001 | 0.033 | 0.138 |
| C(job)[T.technician] | -0.0068 | 0.015 | -0.458 | 0.647 | -0.036 | 0.022 |
| C(job)[T.unemployed] | 0.0746 | 0.027 | 2.773 | 0.006 | 0.022 | 0.127 |
| C(marital)[T.married] | 0.0178 | 0.013 | 1.355 | 0.175 | -0.008 | 0.043 |
| C(marital)[T.single] | 0.0261 | 0.015 | 1.744 | 0.081 | -0.003 | 0.055 |
| C(education)[T.secondary] | 0.0108 | 0.014 | 0.798 | 0.425 | -0.016 | 0.037 |
| C(education)[T.tertiary] | 0.0287 | 0.017 | 1.727 | 0.084 | -0.004 | 0.061 |
| C(housing)[T.yes] | -0.1011 | 0.010 | -10.009 | 0.000 | -0.121 | -0.081 |
| C(loan)[T.yes] | -0.0382 | 0.012 | -3.237 | 0.001 | -0.061 | -0.015 |
| C(month)[T.aug] | 0.1050 | 0.020 | 5.240 | 0.000 | 0.066 | 0.144 |
| C(month)[T.dec] | 0.1243 | 0.036 | 3.495 | 0.000 | 0.055 | 0.194 |
| C(month)[T.feb] | -0.0059 | 0.016 | -0.359 | 0.720 | -0.038 | 0.026 |
| C(month)[T.jan] | -0.0727 | 0.020 | -3.674 | 0.000 | -0.112 | -0.034 |
| C(month)[T.jul] | 0.1828 | 0.027 | 6.877 | 0.000 | 0.131 | 0.235 |
| C(month)[T.jun] | 0.1547 | 0.024 | 6.543 | 0.000 | 0.108 | 0.201 |
| C(month)[T.mar] | 0.2091 | 0.030 | 6.869 | 0.000 | 0.149 | 0.269 |
| C(month)[T.may] | -0.0257 | 0.013 | -1.969 | 0.049 | -0.051 | -0.000 |
| C(month)[T.nov] | -0.0323 | 0.016 | -2.044 | 0.041 | -0.063 | -0.001 |
| C(month)[T.oct] | 0.1450 | 0.023 | 6.182 | 0.000 | 0.099 | 0.191 |
| C(month)[T.sep] | 0.2023 | 0.024 | 8.306 | 0.000 | 0.155 | 0.250 |
| C(poutcome)[T.other] | 0.0334 | 0.010 | 3.318 | 0.001 | 0.014 | 0.053 |
| C(poutcome)[T.success] | 0.4063 | 0.012 | 34.528 | 0.000 | 0.383 | 0.429 |
| T | 0.0414 | 0.016 | 2.639 | 0.008 | 0.011 | 0.072 |
| age | 0.0008 | 0.000 | 1.569 | 0.117 | -0.000 | 0.002 |
| balance | 3.072e-06 | 1.32e-06 | 2.320 | 0.020 | 4.76e-07 | 5.67e-06 |
| campaign | -0.0144 | 0.003 | -5.478 | 0.000 | -0.020 | -0.009 |
| pdays | 0.0001 | 4.29e-05 | 3.052 | 0.002 | 4.69e-05 | 0.000 |
| previous | 0.0018 | 0.001 | 2.059 | 0.040 | 8.62e-05 | 0.004 |
Frisch–Waugh–Lovell (FWL) theorem
f_T = "T ~ age + balance + C(month)"
f_Y = "Y ~ age + balance + C(month)"
mT = smf.ols(f_T, data=df).fit()
mY = smf.ols(f_Y, data=df).fit()
df["T_res"] = mT.resid
df["Y_res"] = mY.resid
m_fwl = smf.ols("Y_res ~ T_res", data=df).fit()
m_fwl.summary().tables[1]
| coef | std err | t | P>|t| | [0.025 | 0.975] |
| Intercept | 7.752e-15 | 0.002 | 4.09e-12 | 1.000 | -0.004 | 0.004 |
| T_res | 0.0348 | 0.007 | 5.130 | 0.000 | 0.022 | 0.048 |
Heterogeneous effects (interaction with month)
m_inter = smf.ols("Y ~ T*C(month) + age + balance", data=df).fit()
m_inter.summary().tables[1]
| coef | std err | t | P>|t| | [0.025 | 0.975] |
| Intercept | 0.1503 | 0.026 | 5.840 | 0.000 | 0.100 | 0.201 |
| C(month)[T.aug] | -0.0609 | 0.032 | -1.880 | 0.060 | -0.124 | 0.003 |
| C(month)[T.dec] | 0.1768 | 0.061 | 2.907 | 0.004 | 0.058 | 0.296 |
| C(month)[T.feb] | -0.0500 | 0.032 | -1.579 | 0.114 | -0.112 | 0.012 |
| C(month)[T.jan] | -0.0935 | 0.038 | -2.433 | 0.015 | -0.169 | -0.018 |
| C(month)[T.jul] | -0.1160 | 0.027 | -4.333 | 0.000 | -0.168 | -0.064 |
| C(month)[T.jun] | 0.0530 | 0.045 | 1.179 | 0.239 | -0.035 | 0.141 |
| C(month)[T.mar] | 0.2151 | 0.053 | 4.096 | 0.000 | 0.112 | 0.318 |
| C(month)[T.may] | -0.1171 | 0.029 | -4.057 | 0.000 | -0.174 | -0.061 |
| C(month)[T.nov] | -0.0911 | 0.030 | -3.063 | 0.002 | -0.149 | -0.033 |
| C(month)[T.oct] | 0.2386 | 0.038 | 6.242 | 0.000 | 0.164 | 0.314 |
| C(month)[T.sep] | 0.1794 | 0.048 | 3.713 | 0.000 | 0.085 | 0.274 |
| T | 0.0127 | 0.025 | 0.508 | 0.612 | -0.036 | 0.062 |
| T:C(month)[T.aug] | -0.0275 | 0.033 | -0.826 | 0.409 | -0.093 | 0.038 |
| T:C(month)[T.dec] | 0.1171 | 0.066 | 1.764 | 0.078 | -0.013 | 0.247 |
| T:C(month)[T.feb] | 0.0232 | 0.033 | 0.703 | 0.482 | -0.042 | 0.088 |
| T:C(month)[T.jan] | 0.0015 | 0.040 | 0.036 | 0.971 | -0.077 | 0.080 |
| T:C(month)[T.jul] | 0.0183 | 0.028 | 0.656 | 0.512 | -0.036 | 0.073 |
| T:C(month)[T.jun] | 0.1884 | 0.047 | 3.995 | 0.000 | 0.096 | 0.281 |
| T:C(month)[T.mar] | 0.1182 | 0.055 | 2.130 | 0.033 | 0.009 | 0.227 |
| T:C(month)[T.may] | 0.0429 | 0.030 | 1.431 | 0.152 | -0.016 | 0.102 |
| T:C(month)[T.nov] | -0.0110 | 0.031 | -0.355 | 0.723 | -0.072 | 0.050 |
| T:C(month)[T.oct] | 0.0058 | 0.041 | 0.141 | 0.888 | -0.075 | 0.087 |
| T:C(month)[T.sep] | 0.1375 | 0.051 | 2.684 | 0.007 | 0.037 | 0.238 |
| age | 0.0007 | 0.000 | 3.828 | 0.000 | 0.000 | 0.001 |
| balance | 4.007e-06 | 6.04e-07 | 6.631 | 0.000 | 2.82e-06 | 5.19e-06 |
Key takeaways
- Regression = adjusted comparison
- Month fixed effects remove seasonality bias
- FWL explains why controls work
- Interactions show when marketing works better