Bayesian_campaign_decisioning_hillstrom_data

Applying Bayesian Models to Real Campaign Data

Uncertainty-aware personalization with practical decisioning

This notebook is written as a blog-style artifact for practitioners.

Its purpose is show how to:

  • apply Bayesian ideas from Chapters 1–4 in practice
  • validate the model instead of blindly trusting it
  • interpret results correctly
  • translate model output into decisioning

Business question

What is the effect of sending an email on conversion, and how should we target when some customer segments are sparse and noisy?

This notebook uses the public Hillstrom / MineThatData email campaign dataset.

#!pip install arviz
#!pip install pymc
#!pip install --upgrade arviz pymc pytensor
#!conda install -c conda-forge gxx
#!pip install pymc==5.10.0 arviz==0.16.1
#!conda activate base
#!conda install -c conda-forge m2w64-toolchain --solver=libmamba -v
import pytensor
print(pytensor.config.cxx)
"C:\Users\revan\minicondanew\Library\mingw-w64\bin\g++.EXE"
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import arviz as az
import pymc as pm

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, log_loss

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

1. Load the data

The notebook is written around the Hillstrom email campaign dataset.

If the public URL does not work in your environment, place the CSV next to the notebook as hillstrom.csv.

To keep the first blog focused, I collapse the two campaign variants into a single binary treatment:

  • treatment = 1 if the customer received any email
  • treatment = 0 if the customer received no email
DATA_URL = "http://www.minethatdata.com/Kevin_Hillstrom_MineThatData_E-MailAnalytics_DataMiningChallenge_2008.03.20.csv"
LOCAL_PATH = "hillstrom.csv"

try:
    df = pd.read_csv(DATA_URL)
    print("Loaded from URL.")
except Exception as e:
    print("URL load failed:", e)
    df = pd.read_csv(LOCAL_PATH)
    print("Loaded from local file.")

df.columns = [c.strip().lower() for c in df.columns]
print(df.shape)
df.head()

Loaded from URL.
(64000, 12)
recency history_segment history mens womens zip_code newbie channel segment visit conversion spend
0 10 2) $100 - $200 142.44 1 0 Surburban 0 Phone Womens E-Mail 0 0 0.0
1 6 3) $200 - $350 329.08 1 1 Rural 1 Web No E-Mail 0 0 0.0
2 7 2) $100 - $200 180.65 0 1 Surburban 1 Web Womens E-Mail 0 0 0.0
3 9 5) $500 - $750 675.83 1 0 Rural 1 Web Mens E-Mail 0 0 0.0
4 2 1) $0 - $100 45.34 1 0 Urban 0 Web Womens E-Mail 0 0 0.0

2. Define treatment, outcome, and pre-treatment features

I use conversion as the main outcome because:

  • Bayesian logistic regression is easy to explain
  • posterior probabilities are intuitive
  • decision thresholds are easier to justify

I restrict predictors to pre-treatment variables only. That is basic causal discipline and keeps the model interpretable.

work = df.copy()

work["treatment"] = (work["segment"] != "No E-Mail").astype(int)
work["y"] = work["conversion"].astype(int)

work["history_segment"] = work["history_segment"].astype(str).str.strip()
work["zip_code"] = work["zip_code"].astype(str).str.strip()
work["channel"] = work["channel"].astype(str).str.strip()

feature_cols = [
    "recency", "history", "mens", "womens", "newbie",
    "zip_code", "channel", "history_segment"
]

work = work[feature_cols + ["treatment", "y"]].copy()
work.head()

recency history mens womens newbie zip_code channel history_segment treatment y
0 10 142.44 1 0 0 Surburban Phone 2) $100 - $200 1 0
1 6 329.08 1 1 1 Rural Web 3) $200 - $350 0 0
2 7 180.65 0 1 1 Surburban Web 2) $100 - $200 1 0
3 9 675.83 1 0 1 Rural Web 5) $500 - $750 1 0
4 2 45.34 1 0 0 Urban Web 1) $0 - $100 1 0

3. Quick descriptive readout

This is not the final causal answer. It is just a reality check before modeling.

raw_summary = work.groupby("treatment")["y"].agg(["mean", "sum", "count"])
raw_summary["conversion_rate_pct"] = 100 * raw_summary["mean"]
raw_summary

mean sum count conversion_rate_pct
treatment
0 0.005726 122 21306 0.572609
1 0.010681 456 42694 1.068066

Interpretation

This tells me whether treated and untreated groups move in the direction I would expect.

But raw rates are not enough for a serious decision system:

  • they do not make uncertainty explicit
  • they do not stabilize small segments
  • they are weak support for action when slices are noisy

4. Classical baseline

I include a standard logistic model as a benchmark.

The point is not to say classical models are bad. The point is to show what Bayesian modeling adds.

numeric_cols = ["recency", "history", "mens", "womens", "newbie", "treatment"]
categorical_cols = ["zip_code", "channel", "history_segment"]

X = work[numeric_cols + categorical_cols]
y = work["y"]

preprocess = ColumnTransformer(
    transformers=[
        ("num", Pipeline([
            ("imputer", SimpleImputer(strategy="median")),
            ("scaler", StandardScaler()),
        ]), numeric_cols),
        ("cat", Pipeline([
            ("imputer", SimpleImputer(strategy="most_frequent")),
            ("ohe", OneHotEncoder(drop="first", handle_unknown="ignore")),
        ]), categorical_cols),
    ]
)

baseline = Pipeline([
    ("prep", preprocess),
    ("model", LogisticRegression(max_iter=2000, random_state=RANDOM_STATE)),
])

baseline.fit(X, y)
baseline_proba = baseline.predict_proba(X)[:, 1]

print("Baseline log loss:", round(log_loss(y, baseline_proba), 5))
print("Baseline AUC:", round(roc_auc_score(y, baseline_proba), 5))

Baseline log loss: 0.05027
Baseline AUC: 0.64399

Interpretation

The baseline is useful, but it still leaves open the business questions I care about:

  • How uncertain is the campaign effect?
  • Which segments are promising but too noisy to trust?
  • What targeting rule would I actually deploy?

That is the gap Bayesian modeling helps fill.

5. Build a compact Bayesian modeling table

For the blog, I use a compact but realistic specification. The idea is to keep the notebook:

  • practical
  • interpretable
  • fast enough to run without turning the blog into an engineering exercise
bayes_df = work[["recency", "history", "mens", "womens", "newbie", "history_segment", "treatment", "y"]].copy()

bayes_df["group"] = bayes_df["history_segment"].astype("category")
bayes_df["group_idx"] = bayes_df["group"].cat.codes
group_names = bayes_df["group"].cat.categories.tolist()

for col in ["recency", "history"]:
    bayes_df[f"{col}_z"] = (bayes_df[col] - bayes_df[col].mean()) / bayes_df[col].std()

MAX_N = 15000
if len(bayes_df) > MAX_N:
    bayes_df = bayes_df.sample(MAX_N, random_state=RANDOM_STATE).reset_index(drop=True)

print(bayes_df.shape)
bayes_df.head()

(15000, 12)
recency history mens womens newbie history_segment treatment y group group_idx recency_z history_z
0 9 64.85 1 0 1 1) $0 - $100 1 0 1) $0 - $100 0 0.922646 -0.691898
1 1 144.12 0 1 0 2) $100 - $200 1 0 2) $100 - $200 1 -1.358121 -0.382441
2 1 734.25 0 1 1 5) $500 - $750 1 0 5) $500 - $750 4 -1.358121 1.921327
3 4 1321.26 1 0 1 7) $1,000 + 1 0 7) $1,000 + 6 -0.502833 4.212915
4 2 261.23 1 1 1 3) $200 - $350 1 0 3) $200 - $350 2 -1.073025 0.074736

6. Prior predictive thinking

Before fitting the model, I ask:

If my priors were true, what sort of conversion rates would they imply?

This is practical Bayesian modeling. Priors are assumptions, so they should generate plausible business outcomes.

with pm.Model() as prior_model:
    alpha = pm.Normal("alpha", mu=-3.0, sigma=1.0)
    beta_t = pm.Normal("beta_t", mu=0.0, sigma=0.5)
    beta_recency = pm.Normal("beta_recency", mu=0.0, sigma=0.5)
    beta_history = pm.Normal("beta_history", mu=0.0, sigma=0.5)
    beta_mens = pm.Normal("beta_mens", mu=0.0, sigma=0.5)
    beta_womens = pm.Normal("beta_womens", mu=0.0, sigma=0.5)
    beta_newbie = pm.Normal("beta_newbie", mu=0.0, sigma=0.5)

    logit_p = (
        alpha
        + beta_t * bayes_df["treatment"].values
        + beta_recency * bayes_df["recency_z"].values
        + beta_history * bayes_df["history_z"].values
        + beta_mens * bayes_df["mens"].values
        + beta_womens * bayes_df["womens"].values
        + beta_newbie * bayes_df["newbie"].values
    )

    p = pm.Deterministic("p", pm.math.sigmoid(logit_p))
    prior_pred = pm.sample_prior_predictive(samples=500, random_seed=RANDOM_STATE)

prior_p = prior_pred.prior["p"].values.reshape(-1)

plt.figure(figsize=(8, 4))
plt.hist(prior_p, bins=50)
plt.title("Prior predictive implied conversion probabilities")
plt.xlabel("Implied probability")
plt.ylabel("Count")
plt.show()

Sampling: [alpha, beta_history, beta_mens, beta_newbie, beta_recency, beta_t, beta_womens]
Fig1.Prior predictive implied conversion probabilities.

Interpretation

What I want from this check:

  • mostly low conversion probabilities
  • enough flexibility for treatment and features to matter
  • no prior-driven nonsense like huge probabilities for routine customers

This is one of the easiest ways to check if the prior-related assumptions are acceptable.

7. Bayesian logistic regression: overall campaign effect

Now I fit a pooled Bayesian logistic model.

This gives me:

  • a posterior distribution over the treatment effect
  • a natural way to ask probability-style questions
  • a cleaner bridge from model output to action
coords = {"obs_id": np.arange(len(bayes_df))}

with pm.Model(coords=coords) as pooled_model:
    treatment = pm.Data("treatment", bayes_df["treatment"].values, dims="obs_id")
    recency_z = pm.Data("recency_z", bayes_df["recency_z"].values, dims="obs_id")
    history_z = pm.Data("history_z", bayes_df["history_z"].values, dims="obs_id")
    mens = pm.Data("mens", bayes_df["mens"].values, dims="obs_id")
    womens = pm.Data("womens", bayes_df["womens"].values, dims="obs_id")
    newbie = pm.Data("newbie", bayes_df["newbie"].values, dims="obs_id")
    y_obs = pm.Data("y_obs", bayes_df["y"].values, dims="obs_id")

    alpha = pm.Normal("alpha", mu=-3.0, sigma=1.0)
    beta_t = pm.Normal("beta_t", mu=0.0, sigma=0.5)
    beta_recency = pm.Normal("beta_recency", mu=0.0, sigma=0.5)
    beta_history = pm.Normal("beta_history", mu=0.0, sigma=0.5)
    beta_mens = pm.Normal("beta_mens", mu=0.0, sigma=0.5)
    beta_womens = pm.Normal("beta_womens", mu=0.0, sigma=0.5)
    beta_newbie = pm.Normal("beta_newbie", mu=0.0, sigma=0.5)

    logit_p = (
        alpha
        + beta_t * treatment
        + beta_recency * recency_z
        + beta_history * history_z
        + beta_mens * mens
        + beta_womens * womens
        + beta_newbie * newbie
    )

    p = pm.Deterministic("p", pm.math.sigmoid(logit_p), dims="obs_id")
    outcome = pm.Bernoulli("outcome", p=p, observed=y_obs, dims="obs_id")

    idata_pooled = pm.sample(
        draws=1000,
        tune=1500, #1000
        chains=4,
        target_accept=0.95, #0.90
        max_treedepth=12, #None
        random_seed=RANDOM_STATE,
        return_inferencedata=True,
    )

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta_t, beta_recency, beta_history, beta_mens, beta_womens, beta_newbie]



Output()


Sampling 4 chains for 1_500 tune and 1_000 draw iterations (6_000 + 4_000 draws total) took 284 seconds.
az.summary(
    idata_pooled,
    var_names=["alpha", "beta_t", "beta_recency", "beta_history", "beta_mens", "beta_womens", "beta_newbie"],
    round_to=3,
)

mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha -5.264 0.276 -5.796 -4.766 0.007 0.004 1459.918 2271.160 1.004
beta_t 0.444 0.177 0.111 0.765 0.003 0.003 3096.992 2597.386 1.002
beta_recency -0.162 0.086 -0.337 -0.006 0.001 0.001 3491.311 2748.742 1.001
beta_history 0.177 0.074 0.041 0.318 0.001 0.001 2695.236 2395.130 1.001
beta_mens 0.476 0.210 0.093 0.876 0.005 0.003 1799.914 2728.155 1.002
beta_womens 0.396 0.213 0.008 0.811 0.005 0.003 1970.892 2451.781 1.003
beta_newbie -0.424 0.163 -0.763 -0.136 0.003 0.003 3121.234 2414.047 1.001
beta_t_samples = idata_pooled.posterior["beta_t"].values.reshape(-1)
prob_positive = (beta_t_samples > 0).mean()
print(f"Posterior P(email effect > 0 on log-odds scale): {prob_positive:.3f}")

Posterior P(email effect > 0 on log-odds scale): 0.996

Interpretation

This is where the Bayesian model becomes genuinely useful for campaign decisioning.

A classical workflow might stop at a coefficient and maybe an interval.

Here I can ask:

What is the probability that the campaign effect is actually positive?

That is much closer to how targeting decisions are made in practice.

8. Posterior versus posterior predictive

This distinction matters.

  • Posterior = uncertainty about parameters
  • Posterior predictive = uncertainty about future outcomes

In practice, the second object is what feeds prediction and action.

with pooled_model:
    post_pred_pooled = pm.sample_posterior_predictive(
        idata_pooled,
        var_names=["outcome", "p"],
        random_seed=RANDOM_STATE,
    )

idata_pooled.extend(post_pred_pooled)

Sampling: [outcome]



Output()


pp_y = idata_pooled.posterior_predictive["outcome"].values
pp_rate = pp_y.mean(axis=(0, 2))
observed_rate = bayes_df["y"].mean()

plt.figure(figsize=(8, 4))
plt.hist(pp_rate, bins=40)
plt.axvline(observed_rate, linestyle="--")
plt.title("Posterior predictive check: overall conversion rate")
plt.xlabel("Predicted conversion rate from posterior predictive draws")
plt.ylabel("Count")
plt.show()

print("Observed conversion rate:", round(observed_rate, 5))

Fig2.Posterior predictive check: overall conversion rate.
Observed conversion rate: 0.00987

Interpretation

This is one of the most practical validation checks in the notebook.

I am asking:

Can the fitted model regenerate data with broad characteristics similar to what I actually observed?

If not, I should not trust its decision recommendations. Since observed coversion rate lies comfortably in the range shown in the histogram, this result passes the check.

9. Hierarchical modeling for personalization

A single global treatment effect is useful, but personalization systems usually need more.

They ask:

  • which groups respond more?
  • which groups are too noisy to trust?
  • where should we target more aggressively?

A naive slice-by-slice estimate can be unstable. A hierarchical model improves this through partial pooling and shrinkage.

For interpretability, I model treatment-effect heterogeneity only across history (prior-spend) segments while controlling for other customer attributes through shared fixed effects. This allows the Bayesian hierarchy to capture meaningful business segmentation without exploding the number of sparse subgroup combinations.

naive_seg = (
    bayes_df.groupby(["group", "treatment"])["y"]
    .agg(["mean", "count", "sum"])
    .reset_index()
)
#raw lift is the difference in mean conversion rates of treatment/control customers  within each segment 
pivot_rate = naive_seg.pivot(index="group", columns="treatment", values="mean")
pivot_n = naive_seg.pivot(index="group", columns="treatment", values="count")
naive_lift = (pivot_rate[1] - pivot_rate[0]).rename("raw_lift")
naive_lift = pd.concat([naive_lift, pivot_n.add_prefix("n_treat_")], axis=1)
naive_lift.sort_values("raw_lift", ascending=False)

raw_lift n_treat_0 n_treat_1
group
4) $350 - $500 0.011695 523 988
5) $500 - $750 0.009376 395 766
3) $200 - $350 0.007777 942 1913
6) $750 - $1,000 0.007264 160 296
1) $0 - $100 0.002492 1824 3638
2) $100 - $200 0.002194 1074 2185
7) $1,000 + -0.003850 87 209
coords_h = {
    "obs_id": np.arange(len(bayes_df)),
    "group": group_names,
}

with pm.Model(coords=coords_h) as hierarchical_model:
    treatment = pm.Data("treatment", bayes_df["treatment"].values, dims="obs_id")
    recency_z = pm.Data("recency_z", bayes_df["recency_z"].values, dims="obs_id")
    history_z = pm.Data("history_z", bayes_df["history_z"].values, dims="obs_id")
    mens = pm.Data("mens", bayes_df["mens"].values, dims="obs_id")
    womens = pm.Data("womens", bayes_df["womens"].values, dims="obs_id")
    newbie = pm.Data("newbie", bayes_df["newbie"].values, dims="obs_id")
    group_idx = pm.Data("group_idx", bayes_df["group_idx"].values, dims="obs_id")
    y_obs = pm.Data("y_obs", bayes_df["y"].values, dims="obs_id")

    alpha = pm.Normal("alpha", mu=-3.0, sigma=1.0) #+> baseline probability is low, roughly in single digit % sigma(-3)~0.047
    beta_recency = pm.Normal("beta_recency", mu=0.0, sigma=0.5)
    beta_history = pm.Normal("beta_history", mu=0.0, sigma=0.5)
    beta_mens = pm.Normal("beta_mens", mu=0.0, sigma=0.5)
    beta_womens = pm.Normal("beta_womens", mu=0.0, sigma=0.5)
    beta_newbie = pm.Normal("beta_newbie", mu=0.0, sigma=0.5)

   # mu_t = pm.Normal("mu_t", mu=0.0, sigma=0.5)
   # sigma_t = pm.HalfNormal("sigma_t", sigma=0.5)
   # beta_t_group = pm.Normal("beta_t_group", mu=mu_t, sigma=sigma_t, dims="group")
   #I use a non-centered hierarchical parameterization to improve Bayesian sampler stability.
    #This reformulation preserves the same statistical model while reducing divergences and improving convergence in sparse-segment settings.

    mu_t = pm.Normal("mu_t", mu=0.0, sigma=0.5)
    sigma_t = pm.HalfNormal("sigma_t", sigma=0.5)
    z_t = pm.Normal("z_t", mu=0.0, sigma=1.0, dims="group")
    beta_t_group = pm.Deterministic("beta_t_group", mu_t + z_t * sigma_t, dims="group")

    logit_p = (
        alpha
        + beta_t_group[group_idx] * treatment
        + beta_recency * recency_z
        + beta_history * history_z
        + beta_mens * mens
        + beta_womens * womens
        + beta_newbie * newbie
    )

    p = pm.Deterministic("p", pm.math.sigmoid(logit_p), dims="obs_id")
    outcome = pm.Bernoulli("outcome", p=p, observed=y_obs, dims="obs_id")

    idata_h = pm.sample(
        draws=1000,
        tune=1000,
        chains=4,
        target_accept=0.92,
        random_seed=RANDOM_STATE,
        return_inferencedata=True,
    )

Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [alpha, beta_recency, beta_history, beta_mens, beta_womens, beta_newbie, mu_t, sigma_t, z_t]



Output()


Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 299 seconds.

To stabilize sampling in the hierarchical treatment-effect model, I use a non-centered parameterization for segment-level treatment effects.

Instead of sampling each segment treatment effect directly from:

β_t_group ~ Normal(μ_t, σ_t)

I reparameterize as:

z_t ~ Normal(0, 1)

β_t_group = μ_t + z_t × σ_t

This mathematically represents the same hierarchical model, but often samples much more efficiently in Bayesian hierarchical settings.

Why this matters:

  • Direct centered parameterization can create strong posterior correlations between group effects and hierarchy parameters.
  • That often leads to divergences, poor effective sample size, and slow NUTS convergence.
  • Non-centered parameterization reduces this geometry problem and improves sampler stability.

This is especially useful when:

  • Some groups are sparse
  • Hierarchical variance is small
  • Partial pooling is strong
az.summary(idata_h, var_names=["mu_t", "sigma_t", "beta_t_group"], round_to=3)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
mu_t 0.427 0.194 0.084 0.812 0.003 0.003 3285.879 2711.547 1.003
sigma_t 0.182 0.142 0.000 0.437 0.003 0.002 1682.274 1786.548 1.000
beta_t_group[1) $0 - $100] 0.439 0.203 0.057 0.825 0.003 0.003 4646.861 3169.858 1.001
beta_t_group[2) $100 - $200] 0.353 0.227 -0.112 0.745 0.004 0.003 3561.391 2925.678 1.002
beta_t_group[3) $200 - $350] 0.419 0.212 0.002 0.802 0.003 0.003 4833.690 3483.099 1.000
beta_t_group[4) $350 - $500] 0.560 0.238 0.123 1.003 0.004 0.003 3055.097 3139.759 1.001
beta_t_group[5) $500 - $750] 0.500 0.240 0.052 0.953 0.004 0.003 3675.060 3178.447 1.001
beta_t_group[6) $750 - $1,000] 0.400 0.266 -0.107 0.895 0.004 0.004 3962.511 3423.523 1.002
beta_t_group[7) $1,000 +] 0.397 0.277 -0.148 0.902 0.004 0.005 3771.996 3328.389 1.001

R-hat ≈ 1.000–1.003 Excellent Effective Sample Size ESS mostly 3000–4800

posterior samples are now reliable enough to interpret as

R-hat < 1.01 ESS > 400 (100 × number_of_chains) Divergences = 0

beta_group = idata_h.posterior["beta_t_group"].stack(sample=("chain", "draw")).values

hier = pd.DataFrame({
    "group": group_names,
    "post_mean_logodds_effect": beta_group.mean(axis=1),
    "post_prob_positive": (beta_group > 0).mean(axis=1),
    "post_p10": np.quantile(beta_group, 0.10, axis=1),
    "post_p90": np.quantile(beta_group, 0.90, axis=1),
})

comparison = naive_lift.reset_index().merge(hier, on="group", how="left")
comparison.sort_values("post_prob_positive", ascending=False)

group raw_lift n_treat_0 n_treat_1 post_mean_logodds_effect post_prob_positive post_p10 post_p90
3 4) $350 - $500 0.011695 523 988 0.560170 0.99675 0.265967 0.884051
4 5) $500 - $750 0.009376 395 766 0.500078 0.98700 0.204615 0.808252
0 1) $0 - $100 0.002492 1824 3638 0.439379 0.98375 0.184999 0.697553
2 3) $200 - $350 0.007777 942 1913 0.418892 0.97775 0.149250 0.681799
5 6) $750 - $1,000 0.007264 160 296 0.399655 0.93500 0.070966 0.726050
6 7) $1,000 + -0.003850 87 209 0.397161 0.92925 0.059925 0.724186
1 2) $100 - $200 0.002194 1074 2185 0.353053 0.92825 0.053629 0.628934
  • The hierarchical Bayesian model identified statistically credible heterogeneity in treatment response across customer value segments. Segment-level treatment effects were partially pooled toward the population mean, reducing noise relative to naïve segment estimates while preserving meaningful differences where supported by data.

Interpretation

This is where the approach becomes especially useful for personalization.

The raw segment effects are easy to compute but often too noisy to trust. The hierarchical model provides:

  • stable segment-level estimates
  • a direct probability that each segment benefits
  • much less temptation to chase random extremes from thin data

The posterior probability of positive treatment effect exceeds 92% across all customer history segments, indicating broad evidence that email campaigns improve conversion. However, the magnitude of expected benefit varies by segment, suggesting that while blanket suppression is not warranted, prioritization by expected incremental value remains appropriate.

10. Turning model output into decisions

A model is only useful if it can support action.

Below is a simple policy-style interpretation:

  • high posterior probability of benefit -> target confidently
  • moderate posterior probability -> promising, but monitor or test more
  • weak posterior probability -> de-prioritize for now
plot_df = comparison.sort_values("post_prob_positive", ascending=True)

plt.figure(figsize=(10, 6))
plt.barh(plot_df["group"].astype(str), plot_df["post_prob_positive"])
plt.axvline(0.8, linestyle="--")
plt.title("Posterior probability that email effect is positive, by segment")
plt.xlim(0.90, 1.0)
plt.xlabel("P(effect > 0)")
plt.ylabel("History segment")
for i, v in enumerate(plot_df["post_prob_positive"]):
    plt.text(v + 0.002, i, f"{v:.3f}", va="center")
plt.show()

Fig3.Posterior probability that email effect is positive, by segment.
decision_df = comparison[["group", "post_prob_positive", "post_mean_logodds_effect", "post_p10", "post_p90"]].copy()

decision_df["recommended_action"] = np.select(
    [
        decision_df["post_prob_positive"] >= 0.90,
        decision_df["post_prob_positive"].between(0.70, 0.90, inclusive="left"),
    ],
    [
        "Target confidently",
        "Promising but monitor / test more",
    ],
    default="Weak evidence: deprioritize for now",
)

decision_df.sort_values("post_prob_positive", ascending=False)

group post_prob_positive post_mean_logodds_effect post_p10 post_p90 recommended_action
3 4) $350 - $500 0.99675 0.560170 0.265967 0.884051 Target confidently
4 5) $500 - $750 0.98700 0.500078 0.204615 0.808252 Target confidently
0 1) $0 - $100 0.98375 0.439379 0.184999 0.697553 Target confidently
2 3) $200 - $350 0.97775 0.418892 0.149250 0.681799 Target confidently
5 6) $750 - $1,000 0.93500 0.399655 0.070966 0.726050 Target confidently
6 7) $1,000 + 0.92925 0.397161 0.059925 0.724186 Target confidently
1 2) $100 - $200 0.92825 0.353053 0.053629 0.628934 Target confidently

Interpretation A naïve segment-level uplift table would have suggested suppressing email to the highest-value customers (1000$+ group) because observed raw lift was negative in that sparse segment as shown in a previous table. However, the hierarchical Bayesian model identified that the apparent negative lift was likely noise caused by limited sample size. After partial pooling, posterior probability of positive treatment effect for that segment remained above 92%. This demonstrates how Bayesian hierarchical models prevent overreaction to noisy segment-level estimates in personalization systems. This is the practical payoff.

The model is not just producing parameters. It is supporting a targeting rule that explicitly accounts for uncertainty.

That is one of the biggest reasons Bayesian models are attractive in campaign analytics and personalization systems.

effect_df = comparison.sort_values("post_mean_logodds_effect", ascending=True).copy()

plt.figure(figsize=(10, 6))
plt.errorbar(
    x=effect_df["post_mean_logodds_effect"],
    y=effect_df["group"].astype(str),
    xerr=[
        effect_df["post_mean_logodds_effect"] - effect_df["post_p10"],
        effect_df["post_p90"] - effect_df["post_mean_logodds_effect"]
    ],
    fmt="o"
)

plt.axvline(0, linestyle="--")
plt.title("Posterior treatment effect by segment on log-odds scale")
plt.xlabel("Posterior treatment effect (log-odds scale)")
plt.ylabel("History segment")
plt.show()
Fig4.Posterior treatment effect by segment on log-odds scale.

These effects are shown on the log-odds scale from the Bayesian logistic model, so they are useful for relative segment ranking but not yet a direct measure of business lift. I use two complementary views of segment-level treatment effects. The first plot shows the posterior probability that email has a positive effect for each segment, which is useful for suppression or go/no-go decisions. The second plot shows posterior mean treatment effect with uncertainty intervals, which is more useful for prioritizing segments by expected incremental impact.

import numpy as np
import pandas as pd

# Pull posterior draws
posterior = idata_h.posterior

alpha_s = posterior["alpha"].values.reshape(-1)
beta_recency_s = posterior["beta_recency"].values.reshape(-1)
beta_history_s = posterior["beta_history"].values.reshape(-1)
beta_mens_s = posterior["beta_mens"].values.reshape(-1)
beta_womens_s = posterior["beta_womens"].values.reshape(-1)
beta_newbie_s = posterior["beta_newbie"].values.reshape(-1)

# shape: (n_groups, n_samples)
beta_t_group_s = posterior["beta_t_group"].stack(sample=("chain", "draw")).values

# Build one row per segment using average feature values inside that segment.
segment_profiles = (
    bayes_df.groupby("group")
    .agg(
        recency_z=("recency_z", "mean"),
        history_z=("history_z", "mean"),
        mens=("mens", "mean"),
        womens=("womens", "mean"),
        newbie=("newbie", "mean"),
        n=("y", "size"),
        observed_conversion=("y", "mean"),
    )
    .reset_index()
)

rows = []

for g_idx, row in segment_profiles.iterrows():
    # Linear predictor under control
    eta_control = (
        alpha_s
        + beta_recency_s * row["recency_z"]
        + beta_history_s * row["history_z"]
        + beta_mens_s * row["mens"]
        + beta_womens_s * row["womens"]
        + beta_newbie_s * row["newbie"]
    )

    # Add treatment effect for treated scenario
    eta_treat = eta_control + beta_t_group_s[g_idx, :]

    # Convert log-odds to probabilities
    p_control = 1 / (1 + np.exp(-eta_control))
    p_treat = 1 / (1 + np.exp(-eta_treat))
    uplift = p_treat - p_control

    rows.append({
        "group": row["group"],
        "n": row["n"],
        "observed_conversion": row["observed_conversion"],
        "post_control_mean": p_control.mean(),
        "post_treat_mean": p_treat.mean(),
        "post_uplift_mean": uplift.mean(),
        "post_uplift_p10": np.quantile(uplift, 0.10),
        "post_uplift_p90": np.quantile(uplift, 0.90),
        "prob_uplift_positive": (uplift > 0).mean(),
    })

uplift_df = pd.DataFrame(rows).sort_values("post_uplift_mean", ascending=False)
uplift_df
group n observed_conversion post_control_mean post_treat_mean post_uplift_mean post_uplift_p10 post_uplift_p90 prob_uplift_positive
6 7) $1,000 + 296 0.020270 0.015221 0.022415 0.007193 0.000990 0.013722 0.92925
3 4) $350 - $500 1511 0.017207 0.009198 0.016049 0.006850 0.003115 0.011291 0.99675
4 5) $500 - $750 1161 0.013781 0.008203 0.013473 0.005271 0.002057 0.008803 0.98700
5 6) $750 - $1,000 456 0.010965 0.010519 0.015694 0.005176 0.000820 0.009574 0.93500
2 3) $200 - $350 2855 0.009457 0.007654 0.011575 0.003922 0.001338 0.006472 0.97775
0 1) $0 - $100 5462 0.008239 0.005577 0.008583 0.003006 0.001281 0.004754 0.98375
1 2) $100 - $200 3259 0.007057 0.006177 0.008776 0.002600 0.000363 0.004645 0.92825
plot_df = uplift_df.sort_values("post_treat_mean", ascending=True)

y_pos = np.arange(len(plot_df))

plt.figure(figsize=(10, 6))
plt.barh(y_pos - 0.2, plot_df["post_control_mean"], height=0.4, label="Control")
plt.barh(y_pos + 0.2, plot_df["post_treat_mean"], height=0.4, label="Treatment")

plt.yticks(y_pos, plot_df["group"].astype(str))
plt.xlabel("Posterior mean conversion probability")
plt.ylabel("History segment")
plt.title("Posterior predicted conversion by segment")
plt.legend()
plt.show()
Fig5.Posterior predicted conversion by segment.

Rather than ranking segments purely on treatment effect coefficient size or raw observed lift, I convert posterior segment effects into expected conversion under treatment and control. This makes the output directly useful for decisioning: it tells me not only whether a segment is likely to benefit, but also how much incremental conversion I should expect if I target it.

While all segments show positive expected uplift, the model suggests prioritizing the highest-history and mid-high-history customers first, with the $1,000+ segment exhibiting the largest expected incremental conversion despite relatively sparse data. This ranking differs materially from naïve raw-lift and coefficient-based ranking, highlighting the importance of converting posterior treatment effects into expected business uplift before decisioning.

plot_df = uplift_df.sort_values("post_uplift_mean", ascending=True).copy()

plt.figure(figsize=(10, 6))
plt.errorbar(
x=plot_df["post_uplift_mean"],
y=plot_df["group"].astype(str),
xerr=[
plot_df["post_uplift_mean"] - plot_df["post_uplift_p10"],
plot_df["post_uplift_p90"] - plot_df["post_uplift_mean"]
],
fmt="o"
)

for i, row in plot_df.reset_index(drop=True).iterrows():
    plt.text(
    row["post_uplift_mean"] + 0.0001,
    i,
    f"{row['post_uplift_mean']:.4f}",
    va="center"
    )

plt.axvline(0, linestyle="--")
plt.title("Posterior expected uplift by segment")
plt.xlabel("Expected uplift in conversion probability")
plt.ylabel("History segment")
plt.show()
Fig6.Posterior expected uplift by segment.

While all segments show positive expected uplift, the model suggests prioritizing the highest-history and mid-high-history customers first, with the $1,000+ segment exhibiting the largest expected incremental conversion despite relatively sparse data. This ranking differs materially from naïve raw-lift and coefficient-based ranking, highlighting the importance of converting posterior treatment effects into expected business uplift before decisioning.

This plot translates posterior treatment effects into expected incremental conversion by segment, which is more directly useful for campaign prioritization than coefficient estimates on the log-odds scale. The point estimate shows expected uplift, while the interval reflects posterior uncertainty.

11. Final takeaways

This post shows how Bayesian Causal modeling can be applied in a real campaign decisioning setting-not just to estimate treatment effects, but to make better targeting decisions under uncertainty. Specifically, I:

  • estimated heterogeneous email treatment effects across customer spend segments
  • used hierarchical partial pooling to stabilize sparse segment estimates
  • validated convergence and sampling diagnostics before trusting the posterior
  • translated posterior treatment effects from log-odds into expected conversion uplift
  • ranked segments using expected incremental conversion rather than naïve observed lift
  • quantified uncertainty using posterior uplift intervals and probability uplift is positive

Practical Lesson:

Bayesian uplift models are most valuable when segment-level treatment effects are noisy, sparse, or high stakes enough that uncertainty should influence targeting decisions—not just point estimates


Written on April 12, 2026