Bank_marketing_causal_regression

Chapter 4 — Linear Regression for Causal Inference

Bank Marketing Case Study

This notebook illustrates Chapter 4 of Matheus Facure’s Causal Inference in Python using the Bank Marketing dataset.

Causal question:
Does contacting customers by cellular instead of telephone increase the probability of subscribing to a term deposit?

Treatment T: contact = cellular (1) vs telephone (0)
Outcome Y: subscription = yes (1) vs no (0)

Dataset (links + download options)

We use the Bank Marketing dataset (Portuguese bank direct marketing campaigns).

Official source (UCI Machine Learning Repository)

Dataset page: https://archive.ics.uci.edu/dataset/222/bank+marketing
(Direct downloadable files are linked on that page.)

https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset

Which file should you use?

Kaggle typically provides bank-full.csv / bank.csv
The UCI dataset provides multiple formats; this notebook loads from UCI via ucimlrepo

📘 Causal Linear Regression — Blog Interpretation Layer

What problem are we solving?

We are trying to estimate the causal effect of a treatment variable on an outcome variable using linear regression under causal assumptions.

Predictive vs Causal Regression

Predictive regression answers:

If X changes, how does Y move in historical data?

Causal regression answers:

If we intervene and change Treatment, how does Y change?

Required assumptions:

No unobserved confounding (Conditional Ignorability)
Correct functional form (or good approximation)
No post-treatment leakage
Sufficient overlap between treated and control populations

Business Interpretation

In marketing / fintech:

Treatment coefficient ≈ Incremental lift
Positive → treatment helps
Negative → treatment hurts
Near zero → no incremental value

⚠️ Key Causal Assumptions Being Used Here

1️⃣ Conditional Ignorability

After controlling for covariates X: Treatment ⟂ Potential Outcomes

If violated → biased effect estimate

2️⃣ Overlap (Positivity)

Every user has some probability of treatment and control

If violated:

Extrapolation risk
Unstable coefficients

3️⃣ No Post-Treatment Controls

Do NOT include variables influenced by treatment This creates collider bias or blocks part of the treatment effect

🧪 Diagnostics — Causal Meaning

Residual Diagnostics:

Random residuals → Model specification reasonable
Patterned residuals → Possible nonlinearity or missing confounder

Coefficient Stability:

Large swings across specs → weak identification or collinearity

Overlap Checks: If treated and control covariate distributions differ heavily: → Model extrapolates → causal estimate fragile

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import os

np.random.seed(42)
FIG_DIR = "figures_ch4_bank_marketing"
os.makedirs(FIG_DIR, exist_ok=True)

!pip install ucimlrepo

Requirement already satisfied: ucimlrepo in c:\users\revan\minicondanew\lib\site-packages (0.0.7)
Requirement already satisfied: pandas>=1.0.0 in c:\users\revan\minicondanew\lib\site-packages (from ucimlrepo) (2.3.3)
Requirement already satisfied: certifi>=2020.12.5 in c:\users\revan\minicondanew\lib\site-packages (from ucimlrepo) (2025.11.12)
Requirement already satisfied: numpy>=1.26.0 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2.3.5)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2025.2)
Requirement already satisfied: tzdata>=2022.7 in c:\users\revan\minicondanew\lib\site-packages (from pandas>=1.0.0->ucimlrepo) (2025.2)
Requirement already satisfied: six>=1.5 in c:\users\revan\minicondanew\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.0.0->ucimlrepo) (1.17.0)

Load public Bank Marketing data (UCI)

from ucimlrepo import fetch_ucirepo

bank = fetch_ucirepo(id=222)
df = pd.concat([bank.data.features, bank.data.targets], axis=1)

df["Y"] = (df["y"].astype(str).str.lower() == "yes").astype(int)
df["contact"] = df["contact"].astype(str).str.lower()
df = df[df["contact"].isin(["cellular","telephone"])].copy()
df["T"] = (df["contact"]=="cellular").astype(int)

df.head()

	age	job	marital	education	default	balance	housing	loan	contact	day_of_week	month	duration	campaign	pdays	poutcome	y	T
12657	27	management	single	secondary	no	35	no	no	cellular	4	jul	255	1	-1	NaN	no	1
12658	54	blue-collar	married	primary	no	466	no	no	cellular	4	jul	297	1	-1	NaN	no	1
12659	43	blue-collar	married	secondary	no	105	no	yes	cellular	4	jul	668	2	-1	NaN	no	1
12660	31	technician	single	secondary	no	19	no	no	telephone	4	jul	65	2	-1	NaN	no	0
12661	27	technician	single	secondary	no	126	yes	yes	cellular	4	jul	436	4	-1	NaN	no	1

Naive regression (difference in means)

📊 How to Interpret the Treatment Coefficient

Treatment Coefficient ≈ Average Treatment Effect (ATE) if assumptions hold

Example: Coefficient = 0.12

Interpretation: If treatment is applied, outcome increases by ~0.12 units on average, holding confounders constant.

Business Translation: Expected incremental lift per treated user ≈ coefficient value

m_naive = smf.ols("Y ~ T", data=df).fit()
m_naive.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.1342	0.007	20.384	0.000	0.121	0.147
T	0.0150	0.007	2.171	0.030	0.001	0.029

Adjusted regression with month fixed effects

📊 How to Interpret the Treatment Coefficient

Treatment Coefficient ≈ Average Treatment Effect (ATE) if assumptions hold

Example: Coefficient = 0.12

Interpretation: If treatment is applied, outcome increases by ~0.12 units on average, holding confounders constant.

Business Translation: Expected incremental lift per treated user ≈ coefficient value

num_controls = ["age","balance","campaign","pdays","previous","day"]
num_controls = [c for c in num_controls if c in df.columns]

cat_controls = ["job","marital","education","housing","loan","month","poutcome"]
cat_controls = [c for c in cat_controls if c in df.columns]

formula = "Y ~ T"
for c in num_controls:
    formula += f" + {c}"
for c in cat_controls:
    formula += f" + C({c})"

m_adj = smf.ols(formula, data=df).fit()
m_adj.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.0799	0.040	2.003	0.045	0.002	0.158
C(job)[T.blue-collar]	-0.0197	0.015	-1.313	0.189	-0.049	0.010
C(job)[T.entrepreneur]	-0.0317	0.027	-1.169	0.243	-0.085	0.022
C(job)[T.housemaid]	-0.0374	0.032	-1.178	0.239	-0.100	0.025
C(job)[T.management]	0.0140	0.016	0.865	0.387	-0.018	0.046
C(job)[T.retired]	0.0218	0.024	0.928	0.354	-0.024	0.068
C(job)[T.self-employed]	0.0060	0.025	0.239	0.811	-0.043	0.055
C(job)[T.services]	0.0029	0.017	0.167	0.867	-0.031	0.037
C(job)[T.student]	0.0855	0.027	3.213	0.001	0.033	0.138
C(job)[T.technician]	-0.0068	0.015	-0.458	0.647	-0.036	0.022
C(job)[T.unemployed]	0.0746	0.027	2.773	0.006	0.022	0.127
C(marital)[T.married]	0.0178	0.013	1.355	0.175	-0.008	0.043
C(marital)[T.single]	0.0261	0.015	1.744	0.081	-0.003	0.055
C(education)[T.secondary]	0.0108	0.014	0.798	0.425	-0.016	0.037
C(education)[T.tertiary]	0.0287	0.017	1.727	0.084	-0.004	0.061
C(housing)[T.yes]	-0.1011	0.010	-10.009	0.000	-0.121	-0.081
C(loan)[T.yes]	-0.0382	0.012	-3.237	0.001	-0.061	-0.015
C(month)[T.aug]	0.1050	0.020	5.240	0.000	0.066	0.144
C(month)[T.dec]	0.1243	0.036	3.495	0.000	0.055	0.194
C(month)[T.feb]	-0.0059	0.016	-0.359	0.720	-0.038	0.026
C(month)[T.jan]	-0.0727	0.020	-3.674	0.000	-0.112	-0.034
C(month)[T.jul]	0.1828	0.027	6.877	0.000	0.131	0.235
C(month)[T.jun]	0.1547	0.024	6.543	0.000	0.108	0.201
C(month)[T.mar]	0.2091	0.030	6.869	0.000	0.149	0.269
C(month)[T.may]	-0.0257	0.013	-1.969	0.049	-0.051	-0.000
C(month)[T.nov]	-0.0323	0.016	-2.044	0.041	-0.063	-0.001
C(month)[T.oct]	0.1450	0.023	6.182	0.000	0.099	0.191
C(month)[T.sep]	0.2023	0.024	8.306	0.000	0.155	0.250
C(poutcome)[T.other]	0.0334	0.010	3.318	0.001	0.014	0.053
C(poutcome)[T.success]	0.4063	0.012	34.528	0.000	0.383	0.429
T	0.0414	0.016	2.639	0.008	0.011	0.072
age	0.0008	0.000	1.569	0.117	-0.000	0.002
balance	3.072e-06	1.32e-06	2.320	0.020	4.76e-07	5.67e-06
campaign	-0.0144	0.003	-5.478	0.000	-0.020	-0.009
pdays	0.0001	4.29e-05	3.052	0.002	4.69e-05	0.000
previous	0.0018	0.001	2.059	0.040	8.62e-05	0.004

Frisch–Waugh–Lovell (FWL) theorem

📊 How to Interpret the Treatment Coefficient

Treatment Coefficient ≈ Average Treatment Effect (ATE) if assumptions hold

Example: Coefficient = 0.12

Interpretation: If treatment is applied, outcome increases by ~0.12 units on average, holding confounders constant.

Business Translation: Expected incremental lift per treated user ≈ coefficient value

🧪 Diagnostics — Causal Meaning

Residual Diagnostics:

Random residuals → Model specification reasonable
Patterned residuals → Possible nonlinearity or missing confounder

Coefficient Stability:

Large swings across specs → weak identification or collinearity

Overlap Checks: If treated and control covariate distributions differ heavily: → Model extrapolates → causal estimate fragile

f_T = "T ~ age + balance + C(month)"
f_Y = "Y ~ age + balance + C(month)"

mT = smf.ols(f_T, data=df).fit()
mY = smf.ols(f_Y, data=df).fit()

df["T_res"] = mT.resid
df["Y_res"] = mY.resid

m_fwl = smf.ols("Y_res ~ T_res", data=df).fit()
m_fwl.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	7.752e-15	0.002	4.09e-12	1.000	-0.004	0.004
T_res	0.0348	0.007	5.130	0.000	0.022	0.048

Heterogeneous effects (interaction with month)

📊 How to Interpret the Treatment Coefficient

Treatment Coefficient ≈ Average Treatment Effect (ATE) if assumptions hold

Example: Coefficient = 0.12

Interpretation: If treatment is applied, outcome increases by ~0.12 units on average, holding confounders constant.

Business Translation: Expected incremental lift per treated user ≈ coefficient value

m_inter = smf.ols("Y ~ T*C(month) + age + balance", data=df).fit()
m_inter.summary().tables[1]

	coef	std err	t	P>\|t\|	[0.025	0.975]
Intercept	0.1503	0.026	5.840	0.000	0.100	0.201
C(month)[T.aug]	-0.0609	0.032	-1.880	0.060	-0.124	0.003
C(month)[T.dec]	0.1768	0.061	2.907	0.004	0.058	0.296
C(month)[T.feb]	-0.0500	0.032	-1.579	0.114	-0.112	0.012
C(month)[T.jan]	-0.0935	0.038	-2.433	0.015	-0.169	-0.018
C(month)[T.jul]	-0.1160	0.027	-4.333	0.000	-0.168	-0.064
C(month)[T.jun]	0.0530	0.045	1.179	0.239	-0.035	0.141
C(month)[T.mar]	0.2151	0.053	4.096	0.000	0.112	0.318
C(month)[T.may]	-0.1171	0.029	-4.057	0.000	-0.174	-0.061
C(month)[T.nov]	-0.0911	0.030	-3.063	0.002	-0.149	-0.033
C(month)[T.oct]	0.2386	0.038	6.242	0.000	0.164	0.314
C(month)[T.sep]	0.1794	0.048	3.713	0.000	0.085	0.274
T	0.0127	0.025	0.508	0.612	-0.036	0.062
T:C(month)[T.aug]	-0.0275	0.033	-0.826	0.409	-0.093	0.038
T:C(month)[T.dec]	0.1171	0.066	1.764	0.078	-0.013	0.247
T:C(month)[T.feb]	0.0232	0.033	0.703	0.482	-0.042	0.088
T:C(month)[T.jan]	0.0015	0.040	0.036	0.971	-0.077	0.080
T:C(month)[T.jul]	0.0183	0.028	0.656	0.512	-0.036	0.073
T:C(month)[T.jun]	0.1884	0.047	3.995	0.000	0.096	0.281
T:C(month)[T.mar]	0.1182	0.055	2.130	0.033	0.009	0.227
T:C(month)[T.may]	0.0429	0.030	1.431	0.152	-0.016	0.102
T:C(month)[T.nov]	-0.0110	0.031	-0.355	0.723	-0.072	0.050
T:C(month)[T.oct]	0.0058	0.041	0.141	0.888	-0.075	0.087
T:C(month)[T.sep]	0.1375	0.051	2.684	0.007	0.037	0.238
age	0.0007	0.000	3.828	0.000	0.000	0.001
balance	4.007e-06	6.04e-07	6.631	0.000	2.82e-06	5.19e-06

Key takeaways

Regression = adjusted comparison
Month fixed effects remove seasonality bias
FWL explains why controls work
Interactions show when marketing works better

🧠 When Linear Regression Works Well for Causal Inference

✅ Large sample size
✅ Good overlap
✅ Strong confounder coverage
✅ Approximately linear effect

🚫 When It Struggles

❌ Strong nonlinear HTE
❌ Hidden confounders
❌ Extreme treatment imbalance
❌ Post-treatment variable leakage

🔄 Bridge to Meta-Learners and Forests

If linear model struggles: → S-Learner / T-Learner
→ X-Learner
→ Causal Forests

Written on December 22, 2025

Bank_marketing_causal_regression

Chapter 4 — Linear Regression for Causal Inference

Bank Marketing Case Study

Dataset (links + download options)

Official source (UCI Machine Learning Repository)

Kaggle mirror (CSV download; requires Kaggle login)

Which file should you use?

📘 Causal Linear Regression — Blog Interpretation Layer

What problem are we solving?

Predictive vs Causal Regression

Business Interpretation

⚠️ Key Causal Assumptions Being Used Here

1️⃣ Conditional Ignorability

2️⃣ Overlap (Positivity)

3️⃣ No Post-Treatment Controls

🧪 Diagnostics — Causal Meaning

Load public Bank Marketing data (UCI)

Naive regression (difference in means)

📊 How to Interpret the Treatment Coefficient

Adjusted regression with month fixed effects

📊 How to Interpret the Treatment Coefficient

Frisch–Waugh–Lovell (FWL) theorem

📊 How to Interpret the Treatment Coefficient

🧪 Diagnostics — Causal Meaning

Heterogeneous effects (interaction with month)

📊 How to Interpret the Treatment Coefficient

Key takeaways

🧠 When Linear Regression Works Well for Causal Inference

🚫 When It Struggles

🔄 Bridge to Meta-Learners and Forests