Email Subjectline Experiment

Randomized Experiments and Stats Review

This notebook applies concepts from Facure Chapter 2: Randomized Experiments & Stats review to a real(ish) email campaign dataset.

The data for this notebook could be downloaded from https://www.kaggle.com/datasets/aristotelisch/playground-mock-email-campaign?resource=download. We assume three CSV files in the same folder:

sent_emails.csv — when and to whom each email was sent (includes Customer_ID, SubjectLine_ID)
responded.csv — which customers responded (includes Customer_ID)
userbase.csv — customer-level attributes (not strictly needed for Chapter 2, but available)

We will:

Build a binary outcome responded (1 if customer appears in responded.csv, else 0).
Define control and two treatments:
- Control: SubjectLine_ID = 1
- Treatment A: SubjectLine_ID = 2 vs control
- Treatment B: SubjectLine_ID = 3 vs control
For each treatment vs control comparison, compute:
- Difference in mean response rates (treatment effect)
- Standard error (SE)
- 95% confidence interval
- t-statistic and p-value
Use Facure’s sample-size rule-of-thumb:
- n ≈ 16 * sigma^2 / delta^2 for 95% significance and 80% power.
Interpret the results in business / marketing terms for each comparison.

1. Load and Merge the Datasets

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt

pd.set_option("display.precision", 4)

# Load the three CSVs
sent = pd.read_csv("./chapter2_data/sent_emails.csv")
resp = pd.read_csv("./chapter2_data/responded.csv")
users = pd.read_csv("./chapter2_data/userbase.csv")

# Mark everyone in 'responded' as responded = 1
resp['responded'] = 1

# Collapse in case some customers responded multiple times
resp_flag = resp.groupby('Customer_ID', as_index=False)['responded'].max()

# Merge sent + response flags (left join keeps all sent emails)
df = sent.merge(resp_flag, on='Customer_ID', how='left')

# Customers not in responded.csv get responded = 0
df['responded'] = df['responded'].fillna(0).astype(int)

df.head()

	Sent_Date	Customer_ID	SubjectLine_ID	responded
0	2016-01-28	1413	2	1
1	2016-03-02	83889	2	1
2	2016-03-09	457832	3	0
3	2016-01-20	127772	1	0
4	2016-02-03	192123	3	1

Quick sanity checks

print("Rows in sent_emails:", len(sent))
print("Unique customers in sent_emails:", sent['Customer_ID'].nunique())
print("Rows in responded:", len(resp))
print("Unique customers in responded:", resp['Customer_ID'].nunique())

print("\nResponse rate overall:")
print(df['responded'].mean())

print("\nDistribution of SubjectLine_ID:")
print(df['SubjectLine_ID'].value_counts())

Rows in sent_emails: 2476354
Unique customers in sent_emails: 496518
Rows in responded: 378208
Unique customers in responded: 264859

Response rate overall:
0.5998322533854207

Distribution of SubjectLine_ID:
SubjectLine_ID
1    826717
2    824837
3    824800
Name: count, dtype: int64

2. Define Control and Treatment Groups

We follow the Chapter 2 structure:

Control group: customers who received SubjectLine_ID = 1
Treatment A: customers who received SubjectLine_ID = 2
Treatment B: customers who received SubjectLine_ID = 3

We will treat each comparison separately:

SubjectLine 2 vs SubjectLine 1
SubjectLine 3 vs SubjectLine 1

This is equivalent to running two two-arm experiments that share the same control group (very similar to no_email vs short and no_email vs long in Facure’s cross-sell example).

control = df[df['SubjectLine_ID'] == 1].copy()
treat2  = df[df['SubjectLine_ID'] == 2].copy()
treat3  = df[df['SubjectLine_ID'] == 3].copy()

print("Sample sizes:")
print("Control (1): ", len(control))
print("Treat 2:     ", len(treat2))
print("Treat 3:     ", len(treat3))

print("\nResponse rates by group:")
group_rates = df.groupby('SubjectLine_ID')['responded'].mean()
print(group_rates)

Sample sizes:
Control (1):  826717
Treat 2:      824837
Treat 3:      824800

Response rates by group:
SubjectLine_ID
1    0.6021
2    0.6025
3    0.5949
Name: responded, dtype: float64

3. Difference in Means, SE, CI, t-Statistic, p-Value

We now create a small helper function that, given a treatment group and a control group, computes:

Mean response in treatment and control
Difference in means (delta)
Standard error of the difference
95% confidence interval for delta
t-statistic under H0: delta = 0
Two-sided p-value

This mirrors the logic in Chapter 2: a simple difference in means test between treatment and control in a randomized experiment.

def diff_in_means_analysis(treat, control, outcome_col='responded'):
    """Compute difference in means, SE, 95% CI, t-statistic and p-value."""
    mu_t = treat[outcome_col].mean()
    mu_c = control[outcome_col].mean()
    delta = mu_t - mu_c

    # Standard error for difference in independent means
    se = np.sqrt(
        treat[outcome_col].var(ddof=1) / len(treat) +
        control[outcome_col].var(ddof=1) / len(control)
    )

    ci_low  = delta - 1.96 * se
    ci_high = delta + 1.96 * se

    # t-statistic for H0: delta = 0
    t_stat = delta / se if se > 0 else np.nan
    p_value = 2 * (1 - stats.norm.cdf(abs(t_stat))) if se > 0 else np.nan

    return {
        'mu_t': mu_t,
        'mu_c': mu_c,
        'delta': delta,
        'se': se,
        'ci_low': ci_low,
        'ci_high': ci_high,
        't_stat': t_stat,
        'p_value': p_value
    }

res_2_vs_1 = diff_in_means_analysis(treat2, control)
res_3_vs_1 = diff_in_means_analysis(treat3, control)

res_2_vs_1, res_3_vs_1

({'mu_t': np.float64(0.6025457150928002),
  'mu_c': np.float64(0.602054874884634),
  'delta': np.float64(0.0004908402081661434),
  'se': np.float64(0.000761672396676801),
  'ci_low': np.float64(-0.0010020376893203867),
  'ci_high': np.float64(0.0019837181056526734),
  't_stat': np.float64(0.6444243093325865),
  'p_value': np.float64(0.5193003251701691)},
 {'mu_t': np.float64(0.5948908826382153),
  'mu_c': np.float64(0.602054874884634),
  'delta': np.float64(-0.007163992246418727),
  'se': np.float64(0.0007628828501613492),
  'ci_low': np.float64(-0.008659242632734971),
  'ci_high': np.float64(-0.005668741860102482),
  't_stat': np.float64(-9.390684618095095),
  'p_value': np.float64(0.0)})

3.1. Summary table of numerical results

summary = pd.DataFrame.from_dict({
    '2_vs_1': res_2_vs_1,
    '3_vs_1': res_3_vs_1
}, orient='index')

summary

	mu_t	mu_c	delta	se	ci_low	ci_high	t_stat	p_value
2_vs_1	0.6025	0.6021	0.0005	0.0008	-0.0010	0.0020	0.6444	0.5193
3_vs_1	0.5949	0.6021	-0.0072	0.0008	-0.0087	-0.0057	-9.3907	0.0000

4. Interpretation: SubjectLine 2 vs 1

r = res_2_vs_1
print("SubjectLine 2 vs 1:\n")
print(f"Control mean (1):        {r['mu_c']:.4f}")
print(f"Treatment mean (2):      {r['mu_t']:.4f}")
print(f"Difference (delta):      {r['delta']:.4f}")
print(f"Standard error (SE):     {r['se']:.6f}")
print(f"95% CI for delta:        [{r['ci_low']:.4f}, {r['ci_high']:.4f}]")
print(f"t-statistic:             {r['t_stat']:.3f}")
print(f"p-value:                 {r['p_value']:.4f}")

SubjectLine 2 vs 1:

Control mean (1):        0.6021
Treatment mean (2):      0.6025
Difference (delta):      0.0005
Standard error (SE):     0.000762
95% CI for delta:        [-0.0010, 0.0020]
t-statistic:             0.644
p-value:                 0.5193

How to read this (2 vs 1)

Difference (delta) tells you how much higher the response rate is for SubjectLine 2 compared to SubjectLine 1 in absolute terms (for example, a value of 0.012 means a 1.2 percentage point uplift).
The 95% confidence interval shows the range of plausible values for the true treatment effect, given this sample.
If the CI includes 0, then we cannot rule out the possibility of no true effect.
The t-statistic measures how many standard errors away from 0 the observed delta is.
The p-value is the probability of seeing a difference at least as extreme as this one, if the true delta were actually 0 (no effect).

Practical rule of thumb (aligned with Chapter 2):

If p_value < 0.05, we say the effect is statistically significant at the 5% level.
If p_value >= 0.05, we say “we do not have enough evidence to reject no effect.”

You should now look at your numbers above and ask:

Is the uplift for SubjectLine 2 vs 1 both statistically significant (p < 0.05) and practically meaningful (large enough in percentage terms to matter for the business)?

5. Interpretation: SubjectLine 3 vs 1

r = res_3_vs_1
print("SubjectLine 3 vs 1:\n")
print(f"Control mean (1):        {r['mu_c']:.4f}")
print(f"Treatment mean (3):      {r['mu_t']:.4f}")
print(f"Difference (delta):      {r['delta']:.4f}")
print(f"Standard error (SE):     {r['se']:.6f}")
print(f"95% CI for delta:        [{r['ci_low']:.4f}, {r['ci_high']:.4f}]")
print(f"t-statistic:             {r['t_stat']:.3f}")
print(f"p-value:                 {r['p_value']:.4f}")

SubjectLine 3 vs 1:

Control mean (1):        0.6021
Treatment mean (3):      0.5949
Difference (delta):      -0.0072
Standard error (SE):     0.000763
95% CI for delta:        [-0.0087, -0.0057]
t-statistic:             -9.391
p-value:                 0.0000

How to read this (3 vs 1)

The same logic applies here as for SubjectLine 2 vs 1:

Check the sign and magnitude of delta to see whether SubjectLine 3 is doing better or worse than SubjectLine 1, and by how many percentage points.
Look at the 95% CI to see the range of plausible true effects.
Check the p-value to see whether the improvement (or drop) is statistically significant.

From a marketing standpoint, you would typically pick the subject line that is:

Statistically significantly better than control (if any), and
Practically large enough to justify rollout (for example, +0.5–1.0 percentage point uplift might be interesting, depending on scale and economics).

6. Power and Sample Size (Facure’s Rule of Thumb)

Chapter 2 introduces a simple and very useful approximation for planning experiment size.

If we want:

95% significance (alpha = 5%)
80% power (1 - beta = 80%)

Then the minimum detectable effect delta (in absolute terms) needs to satisfy:

delta ≈ 2.8 * SE

And if we open up SE and solve for n (per group), we get the rule of thumb:

n ≈ 16 * sigma^2 / delta^2

where:

sigma^2 is the variance of the outcome (here: response 0/1),
delta is the smallest effect you care to detect (for example, a 1% uplift = 0.01).

# Use the control group variance as our estimate of sigma^2
sigma2 = control['responded'].var(ddof=1)

# Suppose we care about detecting a 1 percentage point difference (~0.01)
delta_target = 0.01

n_required = 16 * sigma2 / (delta_target ** 2)

print(f"Estimated sigma^2 from control: {sigma2:.6f}")
print(f"Target detectable effect (delta): {delta_target:.4f}")
print(f"Required sample size per group (approx): {n_required:.1f}")

Estimated sigma^2 from control: 0.239585
Target detectable effect (delta): 0.0100
Required sample size per group (approx): 38333.6

6.1. Compare with actual sample sizes

n_ctrl = len(control)
n_2 = len(treat2)
n_3 = len(treat3)

print(f"Required n per group (approx): {n_required:.1f}")
print(f"Actual n (control, 1):   {n_ctrl}")
print(f"Actual n (treatment, 2): {n_2}")
print(f"Actual n (treatment, 3): {n_3}")

Required n per group (approx): 38333.6
Actual n (control, 1):   826717
Actual n (treatment, 2): 824837
Actual n (treatment, 3): 824800

Interpretation of power / sample size

If your actual group sizes (for control and each treatment) are larger than the n_required, then your experiment is roughly properly powered to detect a 1 percentage point effect.
If your actual group sizes are smaller than n_required, the experiment might be underpowered. In this case:
- Failing to find a statistically significant effect does not mean there is no effect.
- It may simply mean the sample size is too small to reliably detect the effect you care about.

This is exactly the warning in Chapter 2 that “absence of evidence is not evidence of absence.”

7. Final Summary (for Marketing / Product Stakeholders)

Using the email subject line experiment, we have:

Treated the experiment as two separate randomized comparisons:
- SubjectLine 2 vs SubjectLine 1
- SubjectLine 3 vs SubjectLine 1
Computed the uplift in response rate for each comparison.
Quantified uncertainty using standard errors and 95% confidence intervals.
Performed hypothesis tests and examined p-values for statistical significance.
Used a simple power and sample size formula to check whether the experiment was adequately sized to detect a 1 percentage point uplift.

From an experimentation and causal inference perspective, the key takeaways are:

Randomized experiments allow causal interpretation of differences in response rates, assuming customers were randomly assigned to subject lines.
Confidence intervals are as important as point estimates; they visually encode uncertainty.
Failing to achieve statistical significance does not prove “no effect”—especially if the experiment is underpowered.
Planning experiments with power and minimum detectable effect in mind is crucial for making reliable business decisions.

You can now adapt this notebook to any binary outcome experiment in marketing:
new creatives, pricing tests, cross-sell offers, or personalization strategies.

Written on December 6, 2025