Inverse_propensity_weighting_&_doubly_robust_estimation

Propensity Scores in Practice

A fully commented, marketing-focused notebook

Goal of this notebook

This notebook is designed to teach Chapter 5 of Matheus Facure – Causal Inference in Python, not just run code.

You will see:

  • Why naive analysis is biased
  • How propensity scores fix selection bias
  • When IPW breaks
  • Why Doubly Robust (DR) is safer
  • How Double Machine Learning fits conceptually
  • How to explain results to business / MRM stakeholders

We use a public marketing dataset (Hillstrom email campaign) so the ideas map directly to real-world marketing analytics

πŸ“Š Hillstrom (MineThatData) Email Marketing Dataset

Direct download (CSV): https://raw.githubusercontent.com/W-Tran/uplift-modelling/master/data/hillstrom/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv

You can:

Open this link directly in a browser

Or use it in Python with pd.read_csv(url)

πŸ”— Original source (background & documentation)

MineThatData – Email Analytics Challenge (Kevin Hillstrom): https://www.minethatdata.com/data-mining-challenge/

Big Picture (context)

We want to answer a causal question:

Does sending a marketing email cause higher customer spend / conversion?

This is not the same as:

β€œDo customers who receive emails spend more?”

Why?

  • Marketing emails are not sent at random
  • Better / more engaged customers are more likely to be targeted
  • That creates selection bias

Chapter 5 is about how to remove that bias.

Methods we will compare

We will implement and compare four approaches:

  1. Naive regression – shows the bias
  2. IPW (Inverse Propensity Weighting) – design-based correction
  3. Outcome regression – model-based correction
  4. Doubly Robust (AIPW) – combines both
  5. Conceptual Double ML – how DR extends to ML safely

For each method, we answer:

  • What assumption does it rely on?
  • When does it work?
  • When does it fail?

Dataset: Hillstrom Email Marketing (Public)

This dataset comes from a well-known marketing challenge.

Interpretation of columns

  • Each row = one customer
  • segment = which email (if any) the customer received
  • Customer attributes = recency, history, channel, etc.
  • Outcomes = spend and conversion

Treatment definition

  • T = 1 β†’ customer received any email
  • T = 0 β†’ customer received no email

This mirrors real-world campaign targeting.


# === Standard imports ===
# We keep models simple and interpretable on purpose.
# Chapter 5 is about *causality*, not model tuning.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.metrics import roc_auc_score

np.random.seed(42)


# === Load the Hillstrom dataset (public) ===
# If this fails due to network restrictions, download the CSV manually
# and set DATA_PATH accordingly.

import pathlib, urllib.request

DATA_DIR = pathlib.Path("data")
DATA_DIR.mkdir(exist_ok=True)
DATA_PATH = DATA_DIR / "hillstrom.csv"

URL = "https://raw.githubusercontent.com/W-Tran/uplift-modelling/master/data/hillstrom/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"

if not DATA_PATH.exists():
    urllib.request.urlretrieve(URL, DATA_PATH)

df = pd.read_csv(DATA_PATH)
df.head()

recency history_segment history mens womens zip_code newbie channel segment visit conversion spend
0 10 2) $100 - $200 142.44 1 0 Surburban 0 Phone Womens E-Mail 0 0 0.0
1 6 3) $200 - $350 329.08 1 1 Rural 1 Web No E-Mail 0 0 0.0
2 7 2) $100 - $200 180.65 0 1 Surburban 1 Web Womens E-Mail 0 0 0.0
3 9 5) $500 - $750 675.83 1 0 Rural 1 Web Mens E-Mail 0 0 0.0
4 2 1) $0 - $100 45.34 1 0 Urban 0 Web Womens E-Mail 0 0 0.0

# === Define treatment, outcome, and covariates ===

df = df.copy()

# Treatment: received any email vs none
df["T"] = (df["segment"].str.lower() != "no e-mail").astype(int)

# Outcomes
df["Y_spend"] = df["spend"].astype(float)
df["Y_conv"] = df["conversion"].astype(int)

# Covariates (chosen to reflect realistic marketing features)
covariates = [
    "recency", "history", "history_segment",
    "mens", "womens", "zip_code", "newbie", "channel"
]

df[covariates + ["T", "Y_spend", "Y_conv"]].head()

recency history history_segment mens womens zip_code newbie channel T Y_spend Y_conv
0 10 142.44 2) $100 - $200 1 0 Surburban 0 Phone 1 0.0 0
1 6 329.08 3) $200 - $350 1 1 Rural 1 Web 0 0.0 0
2 7 180.65 2) $100 - $200 0 1 Surburban 1 Web 1 0.0 0
3 9 675.83 5) $500 - $750 1 0 Rural 1 Web 1 0.0 0
4 2 45.34 1) $0 - $100 1 0 Urban 0 Web 1 0.0 0

Step 1 β€” Why naive analysis fails

We start with the wrong approach on purpose.

If we simply compare:

  • average spend of emailed customers
  • vs non-emailed customers

we are implicitly assuming random assignment.

That assumption is false in observational marketing data.

πŸ‘‰ This step is crucial for storytelling:
it shows stakeholders why causal methods are needed.


# === Naive comparison (intentionally biased) ===

naive_spend = df.groupby("T")["Y_spend"].mean()
naive_conv  = df.groupby("T")["Y_conv"].mean()

print("Naive spend difference:", naive_spend[1] - naive_spend[0])
print("Naive conversion difference:", naive_conv[1] - naive_conv[0])

# Interpretation:
# This looks like a treatment effect,
# but it mixes true causal impact with selection bias.

Naive spend difference: 0.5967960667278982
Naive conversion difference: 0.004954571155268468

Step 2 β€” Propensity Scores (the key idea of Chapter 5)

The propensity score is:

[ e(X) = P(T=1 | X) ]

Plain English:

β€œGiven what we know about a customer, how likely were they to be emailed?”

Why this matters:

  • Customers with the same propensity score are comparable
  • Conditioning on the propensity score balances all covariates

This is the foundation of IPW and DR.


# === Propensity score model ===

X = df[covariates]
T = df["T"].values

num_cols = [c for c in covariates if pd.api.types.is_numeric_dtype(df[c])]
cat_cols = [c for c in covariates if c not in num_cols]

preprocess = ColumnTransformer(
    [
        ("num", "passthrough", num_cols),
        ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
    ]
)

ps_model = Pipeline(
    [
        ("prep", preprocess),
        ("logit", LogisticRegression(max_iter=2000)),
    ]
)

X_train, X_test, T_train, T_test = train_test_split(
    X, T, test_size=0.3, stratify=T, random_state=42
)

ps_model.fit(X_train, T_train)
e_hat = ps_model.predict_proba(X_test)[:, 1]

print("Propensity AUC:", roc_auc_score(T_test, e_hat))

Propensity AUC: 0.49176685713090773

# === Overlap / positivity check ===

plt.hist(e_hat[T_test == 0], bins=30, alpha=0.6, label="Control", density=True)
plt.hist(e_hat[T_test == 1], bins=30, alpha=0.6, label="Treated", density=True)
plt.legend()
plt.title("Propensity Score Overlap")
plt.xlabel("Estimated e(X)")
plt.show()

# Interpretation:
# Good overlap β†’ IPW is feasible
# Poor overlap β†’ expect unstable weights

Fig1.Propensity score overlap.

Step 3 β€” IPW (Inverse Propensity Weighting)

Core idea

Instead of changing the model, we change the importance of observations.

  • Customers who were unlikely to be emailed but were emailed β†’ get large weight
  • Customers who were very likely to be emailed β†’ get small weight

This creates a pseudo-randomized dataset.

Key assumption

  • The propensity model is correct

# === IPW estimator ===

def ipw_ate(y, t, e, stabilized=False):
    eps = 1e-6
    e = np.clip(e, eps, 1 - eps)

    if stabilized:
        p = t.mean()
        w = np.where(t == 1, p / e, (1 - p) / (1 - e))
    else:
        w = np.where(t == 1, 1 / e, 1 / (1 - e))

    ate = (
        np.sum(w[t == 1] * y[t == 1]) / np.sum(w[t == 1])
        - np.sum(w[t == 0] * y[t == 0]) / np.sum(w[t == 0])
    )
    return ate, w

Y = df.loc[X_test.index, "Y_spend"].values

ate_ipw, w_raw = ipw_ate(Y, T_test, e_hat, stabilized=False)
ate_ipw_stab, w_stab = ipw_ate(Y, T_test, e_hat, stabilized=True)

print("IPW ATE:", ate_ipw)
print("Stabilized IPW ATE:", ate_ipw_stab)

IPW ATE: 0.6869934772562071
Stabilized IPW ATE: 0.6869934772562072

Step 4 β€” Why IPW can be dangerous

IPW can fail badly when:

  • Some customers have propensity scores near 0 or 1
  • Weights explode
  • A few customers dominate the estimate

We will:

  • Inspect weight distributions
  • Compute effective sample size (ESS)
  • Apply stabilization and clipping

This is mandatory in real work.


# === Weight diagnostics ===

def ess(w):
    return (w.sum() ** 2) / (w ** 2).sum()

print("Max raw weight:", w_raw.max())
print("ESS raw:", ess(w_raw))
print("ESS stabilized:", ess(w_stab))

# Interpretation:
# Stabilization usually increases ESS dramatically

Max raw weight: 4.160191208753913
ESS raw: 17043.204198395597
ESS stabilized: 19193.43392200713

Step 5 β€” Doubly Robust Estimation (AIPW)

This is the main takeaway of Chapter 5.

Doubly Robust combines:

  1. A propensity model (design-based)
  2. An outcome model (model-based)

Why it’s powerful

  • If either model is correct β†’ estimate is consistent
  • You get two chances to be right

This is why DR is the default choice in practice.


# === Outcome regression ===

outcome_model = Pipeline(
    [
        ("prep", preprocess),
        ("lr", LinearRegression()),
    ]
)

outcome_model.fit(
    pd.concat([X_train, pd.Series(T_train,index=X_train.index, name="T")], axis=1),
    df.loc[X_train.index, "Y_spend"],
)

# Simpler demonstration:
# Separate models by treatment group

m1 = LinearRegression().fit(
    preprocess.fit_transform(X_train[T_train == 1]),
    df.loc[X_train.index[T_train == 1], "Y_spend"]
)

m0 = LinearRegression().fit(
    preprocess.fit_transform(X_train[T_train == 0]),
    df.loc[X_train.index[T_train == 0], "Y_spend"]
)

m1_hat = m1.predict(preprocess.transform(X_test))
m0_hat = m0.predict(preprocess.transform(X_test))

ate_or = (m1_hat - m0_hat).mean()
print("Outcome regression ATE:", ate_or)

Outcome regression ATE: 0.5598623633614704

# === Doubly Robust / AIPW ===

def aipw_ate(y, t, e, m0, m1):
    eps = 1e-6
    e = np.clip(e, eps, 1 - eps)
    return np.mean(
        (m1 - m0)
        + t * (y - m1) / e
        - (1 - t) * (y - m0) / (1 - e)
    )

ate_dr = aipw_ate(Y, T_test, e_hat, m0_hat, m1_hat)
print("Doubly Robust ATE:", ate_dr)

# Interpretation:
# If either the propensity model OR outcome model is correct,
# this estimate is consistent.

Doubly Robust ATE: 0.6886373812888652

Step 6 β€” Where Double Machine Learning fits

Double ML builds on DR ideas but adds:

  • Cross-fitting
  • Orthogonalization
  • Safe use of ML models

Conceptually:

  • DR = robustness to misspecification
  • Double ML = robustness + protection against ML overfitting

We explain this conceptually (implementation is similar but heavier).

Final Summary (what to remember)

  • Regression trusts the outcome model
  • IPW trusts the assignment mechanism
  • DR trusts either one
  • Double ML lets you use ML safely

In observational marketing data:

Doubly Robust is almost always the right baseline choice.

Written on December 25, 2025