Double Machine Learning Finds segments. Bayesian Decides Which Ones to Trust

From uplift ranking to uncertainty-aware decisions on the Criteo dataset

This notebook is the follow-up to my Bayesian campaign decisioning post built on the Hillstrom dataset.

The Hillstrom post showed that Bayesian hierarchical models can be useful when:

  • segment-level treatment effects are noisy
  • uncertainty should influence targeting decisions
  • raw uplift can be misleading in sparse segments

A natural next question is:

How does a Bayesian uplift workflow compare to a modern causal ML approach like Double Machine Learning (DML)?

This notebook answers that question on the Criteo uplift dataset.

The goal is not to force a winner.

The goal is to understand what each approach is actually good at.


What this notebook tries to demonstrate

From the DML side

  • debiased treatment-effect estimation with flexible nuisance models
  • practical individualized uplift estimation at scale
  • strong default choice when the main objective is causal estimation

From the Bayesian side

  • explicit uncertainty quantification
  • hierarchical shrinkage for noisy personalization slices
  • posterior predictive validation
  • more natural decision support under uncertainty

Core business question

If I want to personalize treatment decisions, when is DML enough, and when does a Bayesian uplift model give me something meaningfully better?


Practical framing

I do not treat these methods as ideological alternatives.

A more mature question is:

  • When do I need robust causal estimation?
  • When do I need stable, uncertainty-aware targeting decisions?

That is the focus in this notebook.

import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import arviz as az
import pymc as pm

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# EconML is used for the DML section.
# If it is not installed in your environment, uncomment the next line:
# !pip install econml

from econml.dml import LinearDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

1. Load the Criteo uplift dataset

This notebook assumes you already have the Criteo uplift dataset available locally.

download data from URL: http://go.criteo.net/criteo-research-uplift-v2.1.csv.gz Set the file path below.

Expected structure

The notebook expects:

  • one binary treatment column
  • one binary outcome column
  • all remaining columns to be pre-treatment covariates

Because local copies of the dataset can differ slightly in naming, I add a small helper block to standardize column names.

If your file uses different names, adjust the mapping in the next cell.

# ------------------------------------------------------------------
# Set your local path here
# ------------------------------------------------------------------
DATA_PATH = "data/criteo-uplift-v2.1.csv.gz"


df = pd.read_csv(DATA_PATH,compression='gzip')

# Normalize column names for easier handling
df.columns = [str(c).strip().lower() for c in df.columns]

print("Shape:", df.shape)
print("Columns:", df.columns.tolist()[:20], "...")
df.head()

Shape: (13979592, 16)
Columns: ['f0', 'f1', 'f2', 'f3', 'f4', 'f5', 'f6', 'f7', 'f8', 'f9', 'f10', 'f11', 'treatment', 'conversion', 'visit', 'exposure'] ...
f0 f1 f2 f3 f4 f5 f6 f7 f8 f9 f10 f11 treatment conversion visit exposure
0 12.616365 10.059654 8.976429 4.679882 10.280525 4.115453 0.294443 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 0 0
1 12.616365 10.059654 9.002689 4.679882 10.280525 4.115453 0.294443 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 0 0
2 12.616365 10.059654 8.964775 4.679882 10.280525 4.115453 0.294443 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 0 0
3 12.616365 10.059654 9.002801 4.679882 10.280525 4.115453 0.294443 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 0 0
4 12.616365 10.059654 9.037999 4.679882 10.280525 4.115453 0.294443 4.833815 3.955396 13.190056 5.300375 -0.168679 1 0 0 0
print(df["treatment"].value_counts(normalize=True))
print(df["conversion"].value_counts(normalize=True))
treatment
1    0.85
0    0.15
Name: proportion, dtype: float64
conversion
0    0.997083
1    0.002917
Name: proportion, dtype: float64

2. Standardize treatment / outcome column names

The exact Criteo file format may vary depending on source or preprocessing. This cell tries to map common naming conventions into:

  • treatment
  • conversion

If your dataset already uses these names, no change is needed. If not, edit the mapping list manually.

# Common alternative names seen across uplift datasets / user preprocessed copies.
treatment_candidates = ["treatment", "exposure", "t", "group", "visit"]
outcome_candidates = ["conversion", "outcome", "y", "label", "response"]

def first_existing(candidates, cols):
    for c in candidates:
        if c in cols:
            return c
    return None

t_col = first_existing(treatment_candidates, df.columns)
y_col = first_existing(outcome_candidates, df.columns)

if t_col is None or y_col is None:
    raise ValueError(
        "Could not automatically identify treatment/outcome columns. "
        "Please set t_col and y_col manually."
    )

work = df.copy()
work = work.rename(columns={t_col: "treatment", y_col: "conversion"})

work["treatment"] = work["treatment"].astype(int)
work["conversion"] = work["conversion"].astype(int)

print("Treatment column used:", t_col)
print("Outcome column used:", y_col)
work[["treatment", "conversion"]].head()

Treatment column used: treatment
Outcome column used: conversion
treatment conversion
0 1 0
1 1 0
2 1 0
3 1 0
4 1 0

3. Basic data audit

Before comparing methods, I want a basic feel for the treatment and outcome balance.

This is not the final answer. It is just a quick reality check.

print("Rows:", len(work))
print("Treatment rate:", round(work["treatment"].mean(), 5))
print("Conversion rate:", round(work["conversion"].mean(), 5))

raw_summary = work.groupby("treatment")["conversion"].agg(["mean", "sum", "count"])
raw_summary["conversion_rate_pct"] = 100 * raw_summary["mean"]
raw_summary

Rows: 13979592
Treatment rate: 0.85
Conversion rate: 0.00292
mean sum count conversion_rate_pct
treatment
0 0.001938 4063 2096937 0.193759
1 0.003089 36711 11882655 0.308946

Interpretation

This quick view tells me:

  • whether treatment assignment is roughly balanced
  • whether conversion is a rare event
  • whether raw treatment-control differences are directionally sensible

But raw differences are still not enough for a serious personalization system.

I still need to know:

  • how much uplift varies by user or segment
  • how noisy those estimates are
  • whether the model supports confident action

4. Split features from treatment and outcome

Everything except treatment and outcome is treated as pre-treatment covariate input.

This notebook assumes the remaining columns are all valid pre-treatment features. If your local file contains IDs or post-treatment leakage variables, remove them here.

#sampling to get a faster run by compromising on performance slightly
work=work.sample(150000,random_state=42).reset_index(drop=True)

exclude_cols = ["treatment", "conversion"]
feature_cols = [c for c in work.columns if c not in exclude_cols]

num_cols = [c for c in feature_cols if pd.api.types.is_numeric_dtype(work[c])]
cat_cols = [c for c in feature_cols if c not in num_cols]

print("Numeric feature count:", len(num_cols))
print("Categorical feature count:", len(cat_cols))

X = work[feature_cols].copy()
T = work["treatment"].astype(int).values
Y = work["conversion"].astype(float).values
Numeric feature count: 14
Categorical feature count: 0

5. Common preprocessing pipeline

To compare methods fairly, I want both workflows to use the same cleaned feature matrix.

This is a practical pipeline:

  • median imputation + scaling for numeric features
  • mode imputation + one-hot encoding for categorical features
preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
        ]), num_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(handle_unknown="ignore")),
        ]), cat_cols),
    ]
)

X_proc = preprocess.fit_transform(X)
X_proc = np.asarray(X_proc.todense()) if hasattr(X_proc, "todense") else np.asarray(X_proc)

print("Processed feature matrix shape:", X_proc.shape)

Processed feature matrix shape: (150000, 14)

6. Double Machine Learning baseline

Double Machine Learning is attractive because it separates the problem into:

  1. Model the outcome using covariates
  2. Model the treatment assignment using covariates
  3. Residualize / orthogonalize
  4. Estimate treatment effects on the debiased residual structure

That makes it a strong choice when:

  • feature relationships are nonlinear
  • confounding structure is rich
  • the main goal is treatment-effect estimation rather than posterior uncertainty
from econml.dml import LinearDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

dml = LinearDML(
model_y=RandomForestRegressor( #econML treats outcome nuisance model as conditional expectation E[Y | X,T]] not a probability classification
n_estimators=100,
min_samples_leaf=100,
random_state=RANDOM_STATE,
n_jobs=-1,
),
model_t=RandomForestClassifier(
n_estimators=100,
min_samples_leaf=100,
random_state=RANDOM_STATE,
n_jobs=-1,
),
discrete_treatment=True, # <-- this is the key fix
cv=3,    
random_state=RANDOM_STATE,
)

dml.fit(Y, T, X=X_proc)

dml_ate = dml.ate(X_proc)
print("DML estimated ATE:", dml_ate)

DML estimated ATE: 0.017506617518227914

Interpretation

The DML ATE gives me a useful first read on whether treatment is helping on average.

That is a good starting point, but it is still only a population-level answer.

For personalization, the real question is not just:

Is the treatment helpful overall?

It is:

Where is the uplift concentrated, and how much should I trust the ranking?

That is why the rest of the notebook moves from a single DML average effect to DML deciles and then to a Bayesian validation layer.

7. Individualized DML uplift estimates

To make DML more relevant to personalization, I compute user-level conditional treatment effect estimates.

These are not posterior distributions. They are point estimates of individualized uplift.

work["dml_cate"] = dml.effect(X_proc)

work["dml_decile"] = pd.qcut(
    work["dml_cate"].rank(method="first"),
    10,
    labels=False
)

work[["dml_cate", "dml_decile"]].head()

dml_cate dml_decile
0 0.002424 9
1 -0.000142 4
2 -0.000153 4
3 -0.000073 5
4 0.000033 5
dml_decile_summary = (
    work.groupby("dml_decile")
    .agg(
        mean_dml_cate=("dml_cate", "mean"),
        conversion=("conversion", "mean"),
        treatment_rate=("treatment", "mean"),
        n=("conversion", "size"),
    )
    .reset_index()
    .sort_values("mean_dml_cate", ascending=False)
)

dml_decile_summary

dml_decile mean_dml_cate conversion treatment_rate n
9 9 0.177048 0.022667 0.894333 15000
8 8 0.001615 0.001067 0.843400 15000
7 7 0.000763 0.000867 0.847067 15000
6 6 0.000279 0.000067 0.843933 15000
5 5 -0.000015 0.000133 0.850000 15000
4 4 -0.000167 0.000067 0.842333 15000
3 3 -0.000296 0.000000 0.846267 15000
2 2 -0.000488 0.000200 0.847733 15000
1 1 -0.000702 0.000267 0.846067 15000
0 0 -0.002971 0.006000 0.835333 15000

Interpretation

At this point DML has done the part it is best at: it has turned the raw feature space into an individualized uplift score and a practical ranking.

That ranking is useful operationally because it tells me who appears most promising according to the causal ML model.

But it still leaves an uncomfortable gap:

  • the ranking is a point-estimate ranking
  • it does not tell me which deciles are genuinely robust
  • it does not tell me how much noise may be sitting inside the top or bottom buckets

That is exactly the gap the Bayesian layer is meant to address.

8. Build a Bayesian comparison dataset

For the Bayesian model, I intentionally do not try to fit a massive full-feature model over every processed covariate.

That would make the notebook heavier and distract from the actual comparison.

Instead, I use a practical compromise:

  • DML already summarized heterogeneous uplift into user ranking
  • I use DML uplift deciles as the grouping structure
  • Bayesian hierarchy then asks whether those ranked segments still look good after shrinkage and uncertainty modeling

This makes the comparison easier to interpret:

  • DML gives a ranking
  • Bayesian tests and stabilizes that ranking
bayes_df = work[["treatment", "conversion", "dml_decile"]].copy()
bayes_df["group"] = bayes_df["dml_decile"].astype(int)
bayes_df["group_idx"] = bayes_df["group"]
#ensuring only 60K max rows are input in to bayes model
MAX_N = 60000
if len(bayes_df) > MAX_N:
    bayes_df = bayes_df.sample(MAX_N, random_state=RANDOM_STATE).reset_index(drop=True)

group_names = sorted(bayes_df["group"].unique().tolist())
coords = {
    "obs_id": np.arange(len(bayes_df)),
    "group": group_names,
}

print(bayes_df.shape)
bayes_df.head()

(60000, 5)
treatment conversion dml_decile group group_idx
0 1 0 5 5 5
1 1 0 3 3 3
2 1 0 0 0 0
3 1 0 8 8 8
4 1 0 0 0 0

9. Prior predictive check

Before fitting the Bayesian model, I want to know:

If my priors were true, would they generate plausible baseline conversion behavior?

If the priors imply absurd conversion behavior, the model is already in trouble before the data has had any chance to update it.

with pm.Model(coords=coords) as prior_model:

    treatment = pm.Data("treatment", bayes_df["treatment"].values, dims="obs_id")
    group_idx = pm.Data("group_idx", bayes_df["group_idx"].values, dims="obs_id")

    alpha = pm.Normal("alpha", mu=-4.0, sigma=1.0)

    mu_t = pm.Normal("mu_t", mu=0.0, sigma=0.5)
    sigma_t = pm.HalfNormal("sigma_t", sigma=0.5)

    z_t = pm.Normal("z_t", mu=0.0, sigma=1.0, dims="group")
    beta_t_group = pm.Deterministic(
        "beta_t_group",
        mu_t + z_t * sigma_t,
        dims="group"
    )

    logit_p = alpha + beta_t_group[group_idx] * treatment
    p = pm.Deterministic("p", pm.math.sigmoid(logit_p), dims="obs_id")

    prior_pred = pm.sample_prior_predictive(samples=500, random_seed=RANDOM_STATE)

Sampling: [alpha, mu_t, sigma_t, z_t]
prior_mean_conv = prior_pred.prior["p"].mean(dim="obs_id").values.flatten()

plt.figure(figsize=(8, 4))
plt.hist(prior_mean_conv, bins=40)
plt.title("Prior predictive distribution of average conversion")
plt.xlabel("Implied average conversion rate")
plt.ylabel("Count")
plt.show()

Fig1. Prior predictive distribution of average conversion.

Interpretation

This prior check is just making sure the Bayesian layer starts from a sane place.

Given how low conversion is in this dataset, I want the prior to imply low baseline conversion and moderate room for treatment heterogeneity — not wild conversion rates that only look good on paper.

So this step is less about excitement and more about discipline:

if the prior already implies unrealistic behavior, any later posterior story becomes much harder to trust.

10. Bayesian hierarchical uplift model

Now I fit the Bayesian model.

Key design choice:

  • treatment effect varies by DML uplift decile
  • those decile-level treatment effects are partially pooled through a shared hierarchy

This lets me test whether the DML ranking survives a more uncertainty-aware, shrinkage-based framework.

with pm.Model(coords=coords) as bayes_model:

    treatment = pm.Data("treatment", bayes_df["treatment"].values, dims="obs_id")
    group_idx = pm.Data("group_idx", bayes_df["group_idx"].values, dims="obs_id")
    y_obs = pm.Data("y_obs", bayes_df["conversion"].values, dims="obs_id")

    alpha = pm.Normal("alpha", mu=-4.0, sigma=1.0)

    mu_t = pm.Normal("mu_t", mu=0.0, sigma=0.5)
    sigma_t = pm.HalfNormal("sigma_t", sigma=0.5)

    z_t = pm.Normal("z_t", mu=0.0, sigma=1.0, dims="group")
    beta_t_group = pm.Deterministic(
        "beta_t_group",
        mu_t + z_t * sigma_t,
        dims="group"
    )

    logit_p = alpha + beta_t_group[group_idx] * treatment
    p = pm.Deterministic("p", pm.math.sigmoid(logit_p), dims="obs_id")

    outcome = pm.Bernoulli("outcome", p=p, observed=y_obs, dims="obs_id")

    idata = pm.sample(
        draws=1000,
        tune=1500,
        chains=4,
        target_accept=0.95,
        random_seed=RANDOM_STATE,
        return_inferencedata=True,
    )

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, mu_t, sigma_t, z_t]



Output()


Sampling 4 chains for 1_500 tune and 1_000 draw iterations (6_000 + 4_000 draws total) took 2400 seconds.

11. Bayesian diagnostic validation

Before interpreting anything, I want to confirm that the posterior fit is trustworthy.

diag_summary = az.summary(idata, var_names=["alpha", "mu_t", "sigma_t", "beta_t_group"], round_to=3)
diag_summary

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha -6.544 0.257 -7.028 -6.065 0.005 0.004 3334.553 2405.224 1.001
mu_t -0.480 0.375 -1.155 0.235 0.008 0.005 2110.148 2628.621 1.001
sigma_t 1.451 0.242 1.046 1.925 0.006 0.004 1816.531 2525.441 1.001
beta_t_group[0] 1.460 0.310 0.904 2.059 0.005 0.005 3797.359 2902.477 1.001
beta_t_group[1] -1.747 0.781 -3.270 -0.394 0.011 0.012 4832.212 3234.974 1.000
beta_t_group[2] -1.732 0.771 -3.119 -0.260 0.013 0.013 3986.051 2344.377 1.000
beta_t_group[3] -2.450 1.001 -4.282 -0.604 0.017 0.018 4025.831 2886.824 1.001
beta_t_group[4] -2.422 0.992 -4.357 -0.765 0.017 0.017 3571.068 3002.445 1.001
beta_t_group[5] -1.303 0.675 -2.608 -0.084 0.011 0.011 3677.438 2901.654 1.001
beta_t_group[6] -2.412 0.970 -4.351 -0.759 0.015 0.017 4477.024 2565.410 1.000
beta_t_group[7] -0.480 0.507 -1.454 0.479 0.008 0.007 4605.365 3588.117 1.001
beta_t_group[8] -0.951 0.615 -2.129 0.124 0.009 0.010 4611.971 3007.763 1.000
beta_t_group[9] 2.878 0.270 2.368 3.370 0.005 0.004 3523.564 2858.652 1.001

Interpretation

Before trusting any Bayesian segment result, I need the diagnostics to look clean.

What I want here is straightforward:

  • no divergences
  • R-hat essentially at 1
  • effective sample size comfortably large

Since those conditions hold, I can treat the posterior segment summaries as real signal rather than sampling noise created by a poorly behaving chain.

12. Posterior predictive check

The next step is not just to inspect coefficients. I also want to know whether the fitted model can reproduce aggregate conversion behavior in the observed data.

with bayes_model:
    ppc = pm.sample_posterior_predictive(idata, var_names=["outcome", "p"], random_seed=RANDOM_STATE)

idata.extend(ppc)

Sampling: [outcome]



Output()


pp_y = idata.posterior_predictive["outcome"].values
pp_rate = pp_y.mean(axis=(0, 2))
observed_rate = bayes_df["conversion"].mean()

plt.figure(figsize=(8, 4))
plt.hist(pp_rate, bins=40)
plt.axvline(observed_rate, linestyle="--")
plt.title("Posterior predictive check: average conversion rate")
plt.xlabel("Posterior predictive conversion rate")
plt.ylabel("Count")
plt.show()

print("Observed average conversion:", round(observed_rate, 6))

Fig2. Posterior predictive check: average conversion rate.
Observed average conversion: 0.003217

Interpretation

This posterior predictive check is deliberately modest.

It is not claiming the Bayesian layer reproduces every aspect of the Criteo data. It is checking something simpler and still important:

Can the fitted model reproduce the overall conversion level in the sample it was trained on?

If the observed rate sits inside the posterior predictive distribution, that is a good aggregate-level calibration sign. It does not prove the model is perfect, but it does tell me the Bayesian layer is not obviously disconnected from the basic conversion behavior in the data.

13. Convert Bayesian treatment effects into business-space uplift

The Bayesian model estimates treatment effects on the log-odds scale.

That is fine statistically, but not what a business audience wants.

For decisioning, I want:

  • predicted conversion under control
  • predicted conversion under treatment
  • expected uplift in conversion probability

coefficient scale is model-space; uplift is business-space.

posterior = idata.posterior

alpha_s = posterior["alpha"].values.reshape(-1)
beta_t_group_s = posterior["beta_t_group"].stack(sample=("chain", "draw")).values

rows = []

segment_profiles = (
    bayes_df.groupby("group")
    .agg(
        n=("conversion", "size"),
        observed_conversion=("conversion", "mean"),
        treated_rate=("treatment", "mean"),
    )
    .reset_index()
)

for g_idx, row in segment_profiles.iterrows():
    eta_control = alpha_s
    eta_treat = alpha_s + beta_t_group_s[g_idx, :]

    p_control = 1 / (1 + np.exp(-eta_control))
    p_treat = 1 / (1 + np.exp(-eta_treat))
    uplift = p_treat - p_control

    rows.append({
        "group": row["group"],
        "n": row["n"],
        "observed_conversion": row["observed_conversion"],
        "post_control_mean": p_control.mean(),
        "post_treat_mean": p_treat.mean(),
        "post_uplift_mean": uplift.mean(),
        "post_uplift_p10": np.quantile(uplift, 0.10),
        "post_uplift_p90": np.quantile(uplift, 0.90),
        "prob_uplift_positive": (uplift > 0).mean(),
    })

uplift_df = pd.DataFrame(rows).sort_values("post_uplift_mean", ascending=False)
uplift_df

group n observed_conversion post_control_mean post_treat_mean post_uplift_mean post_uplift_p10 post_uplift_p90 prob_uplift_positive
9 9.0 6040.0 0.022682 0.001484 0.025020 0.023536 0.020897 0.026238 1.00000
0 0.0 5917.0 0.007267 0.001484 0.006257 0.004773 0.003332 0.006328 1.00000
7 7.0 6019.0 0.000997 0.001484 0.000977 -0.000507 -0.001178 0.000199 0.17075
8 8.0 5994.0 0.000501 0.001484 0.000642 -0.000841 -0.001477 -0.000219 0.04575
5 5.0 6072.0 0.000329 0.001484 0.000469 -0.001015 -0.001586 -0.000446 0.01600
2 2.0 5957.0 0.000168 0.001484 0.000323 -0.001160 -0.001711 -0.000623 0.00375
1 1.0 6054.0 0.000165 0.001484 0.000323 -0.001161 -0.001723 -0.000643 0.00550
6 6.0 5910.0 0.000000 0.001484 0.000190 -0.001294 -0.001827 -0.000804 0.00150
4 4.0 6059.0 0.000000 0.001484 0.000189 -0.001295 -0.001827 -0.000800 0.00150
3 3.0 5978.0 0.000000 0.001484 0.000185 -0.001299 -0.001825 -0.000812 0.00125

Interpretation

This table is where the Bayesian results become decision-friendly.

A few things stand out immediately from these outputs:

  • Decile 9 is the clearest win. Its posterior expected uplift is about 0.0235, with a 100% posterior probability that uplift is positive. That is a very strong confirmation that the top DML bucket contains real treatment signal.
  • Decile 0 is the big surprise. DML gave it a strongly negative mean CATE, but the Bayesian layer estimates a positive expected uplift of about 0.0048 with 100% probability uplift is positive.
  • Most of the remaining deciles are not just uncertain — they are actually negative on posterior mean uplift, and their posterior probability of positive uplift is extremely low.
  • Decile 7 is the only mild gray area: it is close to zero and still has only about 17% probability of positive uplift, which is far too weak to act on confidently.

So the Bayesian layer is not broadly validating the full DML ranking. It is saying something more specific:

the very top DML decile looks real, the very bottom decile deserves a second look, and most of the middle ranking is not convincing once uncertainty is taken seriously.

One technical note matters here: because this Bayesian layer uses treatment and DML decile only — not the full raw feature set — the post_control_mean is common across deciles by construction. So the main objects to interpret are post_uplift_mean and prob_uplift_positive, not the identical control means.

14. Visualize Bayesian expected uplift by segment

This plot is one of the clearest decisioning views in the notebook.

It shows:

  • point estimate of expected uplift
  • posterior uncertainty interval
  • which DML-defined deciles still look good after Bayesian shrinkage
plot_df = uplift_df.sort_values("post_uplift_mean", ascending=True).copy()

plt.figure(figsize=(10, 6))
plt.errorbar(
    x=plot_df["post_uplift_mean"],
    y=plot_df["group"].astype(str),
    xerr=[
        plot_df["post_uplift_mean"] - plot_df["post_uplift_p10"],
        plot_df["post_uplift_p90"] - plot_df["post_uplift_mean"]
    ],
    fmt="o"
)

for i, row in plot_df.reset_index(drop=True).iterrows():
    plt.text(
        row["post_uplift_mean"] + 0.00002,
        i,
        f"{row['post_uplift_mean']:.4f}",
        va="center"
    )

plt.axvline(0, linestyle="--")
plt.title("Bayesian posterior expected uplift by DML decile")
plt.xlabel("Expected uplift in conversion probability")
plt.ylabel("DML uplift decile")
plt.show()

Fig3. Bayesian posterior expected uplift by DML decile.

15. Visualize Bayesian confidence that uplift is positive

This plot answers a different question.

The previous uplift plot was about magnitude.

This plot is about confidence:

How likely is it that treatment helps at all in this decile?

plot_df = uplift_df.sort_values("prob_uplift_positive", ascending=True).copy()

plt.figure(figsize=(10, 6))
plt.barh(plot_df["group"].astype(str), plot_df["prob_uplift_positive"])
plt.axvline(0.8, linestyle="--")
plt.xlim(max(0.0, plot_df["prob_uplift_positive"].min() - 0.02), 1.0)

for i, v in enumerate(plot_df["prob_uplift_positive"]):
    plt.text(v + 0.002, i, f"{v:.3f}", va="center")

plt.title("Bayesian posterior probability uplift is positive")
plt.xlabel("P(uplift > 0)")
plt.ylabel("DML uplift decile")
plt.show()

Fig4. Bayesian posterior probability uplift is positive.

Interpretation

These two Bayesian plots make the story much clearer than the raw DML ranking alone.

The uplift plot shows that only decile 9 is far to the right with a clearly positive interval. Decile 0 is also positive, but much smaller. Everything else is clustered near zero or below it.

The probability plot sharpens that further:

  • Decile 9: posterior probability uplift is positive = 1.000
  • Decile 0: posterior probability uplift is positive = 1.000
  • Decile 7: only about 0.171
  • all remaining deciles: effectively near zero confidence

That means the Bayesian layer is doing exactly what I wanted it to do. It is forcing me to separate:

  • segments that look strong enough to act on
  • segments that only looked interesting as point estimates
  • segments that likely contain little real treatment signal

Put differently, DML gave me a broad ranking; Bayesian turned that into a much narrower set of segments I would actually be comfortable discussing in a deployment conversation.

16. Compare DML ranking and Bayesian decisioning

This is the key practical section.

DML gives:

  • a personalized uplift ranking

Bayesian gives:

  • stabilized expected uplift
  • uncertainty intervals
  • probability uplift is positive

That means the two methods are not really competing on the exact same output. They are serving different parts of the decision workflow.

comparison_df = (
    dml_decile_summary[["dml_decile", "mean_dml_cate", "n"]]
    .rename(columns={"dml_decile": "group"})
    .merge(
        uplift_df[["group", "post_uplift_mean", "prob_uplift_positive", "post_control_mean", "post_treat_mean"]],
        on="group",
        how="left",
    )
    .sort_values("mean_dml_cate", ascending=False)
)

comparison_df

group mean_dml_cate n post_uplift_mean prob_uplift_positive post_control_mean post_treat_mean
0 9 0.177048 15000 0.023536 1.00000 0.001484 0.025020
1 8 0.001615 15000 -0.000841 0.04575 0.001484 0.000642
2 7 0.000763 15000 -0.000507 0.17075 0.001484 0.000977
3 6 0.000279 15000 -0.001294 0.00150 0.001484 0.000190
4 5 -0.000015 15000 -0.001015 0.01600 0.001484 0.000469
5 4 -0.000167 15000 -0.001295 0.00150 0.001484 0.000189
6 3 -0.000296 15000 -0.001299 0.00125 0.001484 0.000185
7 2 -0.000488 15000 -0.001160 0.00375 0.001484 0.000323
8 1 -0.000702 15000 -0.001161 0.00550 0.001484 0.000323
9 0 -0.002971 15000 0.004773 1.00000 0.001484 0.006257

Interpretation

This comparison table answers the main practical question of the notebook:

Does the DML ranking hold up once I impose shrinkage and uncertainty-aware validation?

The answer here is: only partially.

What is clearly validated:

  • Decile 9 remains the strongest segment by a wide margin on both DML ranking and Bayesian uplift. This is the cleanest case where the two methods agree.

What is not validated:

  • Decile 8 ranked second by DML, but the Bayesian layer gives it negative posterior mean uplift and only about 4.6% probability of positive uplift.
  • Decile 7 also ranked highly in DML, but Bayesian confidence is still weak at roughly 17%.
  • Several lower and middle deciles look actively unattractive after Bayesian validation.

What is unexpected:

  • Decile 0 flips the story. DML treats it as the worst bucket, but the Bayesian layer sees a meaningfully positive uplift with very high confidence in the sampled data.

That is exactly the kind of result that justifies this hybrid workflow. If I had stopped at DML, I would likely have treated the full ranking as trustworthy. The Bayesian layer shows that this would have been too optimistic.

The practical takeaway is not that DML is wrong. It is that:

DML ranking by itself is not enough to decide deployment confidence.

17. Optional decision rule

The Bayesian results naturally lead to a three-bucket decision framework.

1. Target confidently

Use this when:

  • expected uplift is clearly positive
  • the posterior interval stays on the positive side (p10>0 & p90>0)
  • P(uplift > 0) is very high

In these results, decile 9 clearly belongs here.
Both DML and Bayesian modeling agree that it is the strongest segment, and the Bayesian layer removes any real ambiguity about whether the uplift is positive.

2. Investigate before acting

Use this when:

  • the Bayesian result disagrees materially with the DML ranking
  • the segment could matter, but the disagreement is too large to ignore

In these results, decile 0 belongs here.

DML ranked decile 0 as the worst segment, but the Bayesian layer estimates positive uplift with very high confidence. That is not the kind of disagreement I would brush aside.

I would not immediately deploy against decile 0 at scale, but I also would not discard it. Instead, I would treat it as a high-priority diagnostic segment:

  • inspect raw treatment vs control conversion inside the bucket
  • check whether the segment contains substructure that DML compressed too aggressively
  • validate with a focused follow-up test if the business stakes justify it

3. Deprioritize

Use this when:

  • posterior mean uplift is near zero or negative
  • posterior probability of positive uplift is weak

That is what I would do with most of the remaining deciles in this notebook. Even if some looked interesting in DML, the Bayesian layer does not give me enough confidence to act on them aggressively.

This is where the hybrid workflow becomes useful:

  • DML helps surface candidate segments
  • Bayesian modeling helps decide which of those candidates deserve real attention
decision_df = comparison_df.copy()

decision_df["recommended_action"] = np.select(
    [
        (decision_df["prob_uplift_positive"] >= 0.90) & (decision_df["post_uplift_mean"] > 0),
        (decision_df["prob_uplift_positive"] >= 0.70) & (decision_df["post_uplift_mean"] > 0),
    ],
    [
        "Target confidently",
        "Promising but monitor / test",
    ],
    default="Weak evidence / low priority",
)

decision_df.sort_values(["prob_uplift_positive", "post_uplift_mean"], ascending=False)

group mean_dml_cate n post_uplift_mean prob_uplift_positive post_control_mean post_treat_mean recommended_action
0 9 0.177048 15000 0.023536 1.00000 0.001484 0.025020 Target confidently
9 0 -0.002971 15000 0.004773 1.00000 0.001484 0.006257 Target confidently
2 7 0.000763 15000 -0.000507 0.17075 0.001484 0.000977 Weak evidence / low priority
1 8 0.001615 15000 -0.000841 0.04575 0.001484 0.000642 Weak evidence / low priority
4 5 -0.000015 15000 -0.001015 0.01600 0.001484 0.000469 Weak evidence / low priority
8 1 -0.000702 15000 -0.001161 0.00550 0.001484 0.000323 Weak evidence / low priority
7 2 -0.000488 15000 -0.001160 0.00375 0.001484 0.000323 Weak evidence / low priority
3 6 0.000279 15000 -0.001294 0.00150 0.001484 0.000190 Weak evidence / low priority
5 4 -0.000167 15000 -0.001295 0.00150 0.001484 0.000189 Weak evidence / low priority
6 3 -0.000296 15000 -0.001299 0.00125 0.001484 0.000185 Weak evidence / low priority

18. Final practical comparison

What DML did well in this notebook

DML did the heavy lifting in the original feature space. It turned the raw Criteo covariates into an individualized uplift score and a first-pass ranking that is actually usable.

That matters. Without DML, I would not have a practical way to sort a large feature space into candidate treatment segments this efficiently.

What the Bayesian layer added

The Bayesian layer answered the question DML alone could not answer:

Which parts of that ranking still look credible once uncertainty and partial pooling are taken seriously?

That extra layer changed the story in a meaningful way.

  • It strongly confirmed decile 9
  • It rejected most of the rest of the ranking as weak, negative, or too uncertain
  • It surfaced a major disagreement in decile 0, where DML and Bayesian modeling point in opposite directions

That last point is especially important. If I had stopped at the DML ranking, I would likely have treated decile 0 as safely ignorable. The Bayesian layer says that would have been too casual.

How I would use the two methods together

If I were operationalizing this workflow, I would use the methods in sequence rather than forcing them into a winner-take-all comparison.

  • DML is the discovery engine: it searches the rich covariate space and produces candidate uplift structure
  • Bayesian modeling is the credibility filter: it tells me which parts of that structure survive uncertainty-aware validation

In this notebook, that leads to a simple action map:

  • Decile 9 → target confidently
  • Decile 0 → investigate because the methods disagree
  • Most other deciles → deprioritize

19. Final takeaways

This notebook ends up making a more useful point than a simple “Bayesian vs DML” contest.

The two methods are solving different parts of the same business problem.

What the results say here

  • Decile 9 is the cleanest success case: DML ranked it highest, and the Bayesian layer strongly confirmed that it has real positive uplift.
  • The rest of the DML ranking is much less compelling once uncertainty is introduced.
  • Decile 8 and decile 7 looked attractive on the DML side, but Bayesian validation does not support confident action there.
  • Decile 0 is the most interesting segment in the notebook: DML ranked it worst, while Bayesian modeling estimates positive uplift with near certainty in the sampled data.

That means the Bayesian layer is not just “adding uncertainty” in the abstract. It is materially changing the deployment conversation.

Instead of treating the DML ranking as a finished answer, I now have a more realistic segmentation:

  1. Deploy with confidence → decile 9
  2. Investigate because the disagreement is too important to ignore → decile 0
  3. Treat as weak / low-priority for now → most other deciles

My practical conclusion

I would not deploy the raw DML ranking as-is.

I would use:

  • DML to generate candidate uplift segments
  • Bayesian validation to determine which of those segments are credible, which are weak, and which deserve deeper investigation

That is the main lesson from this notebook:

point estimates are useful for discovery, but uncertainty is what makes a ranking deployable.

Written on April 17, 2026