Email Subject Line Experiment
Randomized Experiments and Statistical Review
Abstract
This analysis builds practical intuition for randomized experiments and how to correctly interpret treatment effects in real-world settings.
Using a realistic email campaign example, we analyze how different subject lines impact customer response rates. Beyond computing differences, the focus is on understanding uncertainty, statistical significance, and whether observed effects are actually meaningful.
The goal is to answer a practical question:
When we see a lift in an A/B test, how do we know if it is real, reliable, and worth acting on?
Introduction
Randomized experiments are the gold standard for measuring causal effects. When implemented correctly, they allow us to isolate the impact of an intervention from noise and confounding factors.
In practice, however, many mistakes arise not from running experiments, but from incorrect interpretation of results — overreacting to noise, misreading statistical significance, or ignoring practical impact.
This tutorial focuses on building intuition, not just formulas.
We use a simple but realistic marketing example — an email campaign — to walk through how to:
- Compare treatment vs control groups
- Quantify uncertainty using standard errors and confidence intervals
- Evaluate statistical significance
- Interpret results in a business context
Data Source
The dataset used in this analysis is publicly available on Kaggle:
https://www.kaggle.com/datasets/aristotelisch/playground-mock-email-campaign
It simulates a real-world email marketing campaign with randomized subject lines and customer responses.
Dataset
The analysis uses three datasets:
-
sent_emails.csv — when and to whom each email was sent
(includes Customer_ID, SubjectLine_ID) -
responded.csv — which customers responded
(includes Customer_ID) -
userbase.csv — customer-level attributes
(not required for this analysis, but available)
Experiment Setup
We construct a randomized experiment with one control and two treatments:
- Control: SubjectLine_ID = 1
- Treatment A: SubjectLine_ID = 2 vs control
- Treatment B: SubjectLine_ID = 3 vs control
The outcome variable is:
- responded = 1 if the customer appears in responded.csv, else 0
What We Will Do
For each treatment vs control comparison, we will compute:
- Difference in mean response rates (treatment effect)
- Standard Error (SE)
- 95% Confidence Interval
- t-statistic and p-value
We will also:
- Apply a simple sample size rule-of-thumb
- Interpret results in business terms (not just statistical terms)
Layout of the Analysis
The notebook is structured as follows:
- Load and merge datasets
- Construct experiment groups
- Compute treatment effects
- Estimate uncertainty (SE and confidence intervals)
- Perform hypothesis testing
- Evaluate statistical vs practical significance
- Summarize key takeaways
Reference
This analysis builds on standard concepts in randomized experiments from:
- Matheus Facure — Causal Inference for the Brave and True (Chapter 2: Randomized Experiments)
However, all examples, datasets, and interpretations are tailored to practical marketing use cases.
1. Load and Merge the Datasets
import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
pd.set_option("display.precision", 4)
# Load the three CSVs
sent = pd.read_csv("./chapter2_data/sent_emails.csv")
resp = pd.read_csv("./chapter2_data/responded.csv")
users = pd.read_csv("./chapter2_data/userbase.csv")
# Mark everyone in 'responded' as responded = 1
resp['responded'] = 1
# Collapse in case some customers responded multiple times
resp_flag = resp.groupby('Customer_ID', as_index=False)['responded'].max()
# Merge sent + response flags (left join keeps all sent emails)
df = sent.merge(resp_flag, on='Customer_ID', how='left')
# Customers not in responded.csv get responded = 0
df['responded'] = df['responded'].fillna(0).astype(int)
df.head()
| Sent_Date | Customer_ID | SubjectLine_ID | responded | |
|---|---|---|---|---|
| 0 | 2016-01-28 | 1413 | 2 | 1 |
| 1 | 2016-03-02 | 83889 | 2 | 1 |
| 2 | 2016-03-09 | 457832 | 3 | 0 |
| 3 | 2016-01-20 | 127772 | 1 | 0 |
| 4 | 2016-02-03 | 192123 | 3 | 1 |
Quick sanity checks
print("Rows in sent_emails:", len(sent))
print("Unique customers in sent_emails:", sent['Customer_ID'].nunique())
print("Rows in responded:", len(resp))
print("Unique customers in responded:", resp['Customer_ID'].nunique())
print("\nResponse rate overall:")
print(df['responded'].mean())
print("\nDistribution of SubjectLine_ID:")
print(df['SubjectLine_ID'].value_counts())
Rows in sent_emails: 2476354
Unique customers in sent_emails: 496518
Rows in responded: 378208
Unique customers in responded: 264859
Response rate overall:
0.5998322533854207
Distribution of SubjectLine_ID:
SubjectLine_ID
1 826717
2 824837
3 824800
Name: count, dtype: int64
2. Define Control and Treatment Groups
We follow the below structure:
- Control group: customers who received SubjectLine_ID = 1
- Treatment A: customers who received SubjectLine_ID = 2
- Treatment B: customers who received SubjectLine_ID = 3
We will treat each comparison separately:
- SubjectLine 2 vs SubjectLine 1
- SubjectLine 3 vs SubjectLine 1
This is equivalent to running two two-arm experiments that share the same control group.
control = df[df['SubjectLine_ID'] == 1].copy()
treat2 = df[df['SubjectLine_ID'] == 2].copy()
treat3 = df[df['SubjectLine_ID'] == 3].copy()
print("Sample sizes:")
print("Control (1): ", len(control))
print("Treat 2: ", len(treat2))
print("Treat 3: ", len(treat3))
print("\nResponse rates by group:")
group_rates = df.groupby('SubjectLine_ID')['responded'].mean()
print(group_rates)
Sample sizes:
Control (1): 826717
Treat 2: 824837
Treat 3: 824800
Response rates by group:
SubjectLine_ID
1 0.6021
2 0.6025
3 0.5949
Name: responded, dtype: float64
3. Difference in Means, SE, CI, t-Statistic, p-Value
We now create a small helper function that, given a treatment group and a control group, computes:
- Mean response in treatment and control
- Difference in means (delta)
- Standard error of the difference
- 95% confidence interval for delta
- t-statistic under H0: delta = 0
- Two-sided p-value
This mirrors the logic in standard literature: a simple difference in means test between treatment and control in a randomized experiment.
def diff_in_means_analysis(treat, control, outcome_col='responded'):
"""Compute difference in means, SE, 95% CI, t-statistic and p-value."""
mu_t = treat[outcome_col].mean()
mu_c = control[outcome_col].mean()
delta = mu_t - mu_c
# Standard error for difference in independent means
se = np.sqrt(
treat[outcome_col].var(ddof=1) / len(treat) +
control[outcome_col].var(ddof=1) / len(control)
)
ci_low = delta - 1.96 * se
ci_high = delta + 1.96 * se
# t-statistic for H0: delta = 0
t_stat = delta / se if se > 0 else np.nan
p_value = 2 * (1 - stats.norm.cdf(abs(t_stat))) if se > 0 else np.nan
return {
'mu_t': mu_t,
'mu_c': mu_c,
'delta': delta,
'se': se,
'ci_low': ci_low,
'ci_high': ci_high,
't_stat': t_stat,
'p_value': p_value
}
res_2_vs_1 = diff_in_means_analysis(treat2, control)
res_3_vs_1 = diff_in_means_analysis(treat3, control)
res_2_vs_1, res_3_vs_1
({'mu_t': np.float64(0.6025457150928002),
'mu_c': np.float64(0.602054874884634),
'delta': np.float64(0.0004908402081661434),
'se': np.float64(0.000761672396676801),
'ci_low': np.float64(-0.0010020376893203867),
'ci_high': np.float64(0.0019837181056526734),
't_stat': np.float64(0.6444243093325865),
'p_value': np.float64(0.5193003251701691)},
{'mu_t': np.float64(0.5948908826382153),
'mu_c': np.float64(0.602054874884634),
'delta': np.float64(-0.007163992246418727),
'se': np.float64(0.0007628828501613492),
'ci_low': np.float64(-0.008659242632734971),
'ci_high': np.float64(-0.005668741860102482),
't_stat': np.float64(-9.390684618095095),
'p_value': np.float64(0.0)})
3.1. Summary table of numerical results
summary = pd.DataFrame.from_dict({
'2_vs_1': res_2_vs_1,
'3_vs_1': res_3_vs_1
}, orient='index')
summary
| mu_t | mu_c | delta | se | ci_low | ci_high | t_stat | p_value | |
|---|---|---|---|---|---|---|---|---|
| 2_vs_1 | 0.6025 | 0.6021 | 0.0005 | 0.0008 | -0.0010 | 0.0020 | 0.6444 | 0.5193 |
| 3_vs_1 | 0.5949 | 0.6021 | -0.0072 | 0.0008 | -0.0087 | -0.0057 | -9.3907 | 0.0000 |
4. Interpretation: SubjectLine 2 vs 1
r = res_2_vs_1
print("SubjectLine 2 vs 1:\n")
print(f"Control mean (1): {r['mu_c']:.4f}")
print(f"Treatment mean (2): {r['mu_t']:.4f}")
print(f"Difference (delta): {r['delta']:.4f}")
print(f"Standard error (SE): {r['se']:.6f}")
print(f"95% CI for delta: [{r['ci_low']:.4f}, {r['ci_high']:.4f}]")
print(f"t-statistic: {r['t_stat']:.3f}")
print(f"p-value: {r['p_value']:.4f}")
SubjectLine 2 vs 1:
Control mean (1): 0.6021
Treatment mean (2): 0.6025
Difference (delta): 0.0005
Standard error (SE): 0.000762
95% CI for delta: [-0.0010, 0.0020]
t-statistic: 0.644
p-value: 0.5193
How to read this (2 vs 1)
- Difference (delta) tells you how much higher the response rate is for SubjectLine 2
compared to SubjectLine 1 in absolute terms (for example, a value of
0.012means a 1.2 percentage point uplift). - The 95% confidence interval shows the range of plausible values for the true treatment effect, given this sample.
- If the CI includes 0, then we cannot rule out the possibility of no true effect.
- The t-statistic measures how many standard errors away from 0 the observed delta is.
- The p-value is the probability of seeing a difference at least as extreme as this one, if the true delta were actually 0 (no effect).
Practical rule of thumb (following standard A/B testing interpretation):
- If
p_value < 0.05, we say the effect is statistically significant at the 5% level. - If
p_value >= 0.05, we say “we do not have enough evidence to reject no effect.”
You should now look at your numbers above and ask:
- Is the uplift for SubjectLine 2 vs 1 both statistically significant (p < 0.05) and practically meaningful (large enough in percentage terms to matter for the business)?
5. Interpretation: SubjectLine 3 vs 1
r = res_3_vs_1
print("SubjectLine 3 vs 1:\n")
print(f"Control mean (1): {r['mu_c']:.4f}")
print(f"Treatment mean (3): {r['mu_t']:.4f}")
print(f"Difference (delta): {r['delta']:.4f}")
print(f"Standard error (SE): {r['se']:.6f}")
print(f"95% CI for delta: [{r['ci_low']:.4f}, {r['ci_high']:.4f}]")
print(f"t-statistic: {r['t_stat']:.3f}")
print(f"p-value: {r['p_value']:.4f}")
SubjectLine 3 vs 1:
Control mean (1): 0.6021
Treatment mean (3): 0.5949
Difference (delta): -0.0072
Standard error (SE): 0.000763
95% CI for delta: [-0.0087, -0.0057]
t-statistic: -9.391
p-value: 0.0000
How to read this (3 vs 1)
The same logic applies here as for SubjectLine 2 vs 1:
- Check the sign and magnitude of
deltato see whether SubjectLine 3 is doing better or worse than SubjectLine 1, and by how many percentage points. - Look at the 95% CI to see the range of plausible true effects.
- Check the p-value to see whether the improvement (or drop) is statistically significant.
From a marketing standpoint, you would typically pick the subject line that is:
- Statistically significantly better than control (if any), and
- Practically large enough to justify rollout (for example, +0.5–1.0 percentage point uplift might be interesting, depending on scale and economics).
6. Power and Sample Size (Facure’s Rule of Thumb)
A commonly used approximation for planning experiment is:
If we want:
- 95% significance (alpha = 5%)
- 80% power (1 - beta = 80%)
Then the minimum detectable effect delta (in absolute terms) needs to satisfy:
delta ≈ 2.8 * SE
And if we open up SE and solve for n (per group), we get the rule of thumb:
n ≈ 16 * sigma^2 / delta^2
where:
sigma^2is the variance of the outcome (here: response 0/1),deltais the smallest effect you care to detect (for example, a 1% uplift = 0.01).
# Use the control group variance as our estimate of sigma^2
sigma2 = control['responded'].var(ddof=1)
# Suppose we care about detecting a 1 percentage point difference (~0.01)
delta_target = 0.01
n_required = 16 * sigma2 / (delta_target ** 2)
print(f"Estimated sigma^2 from control: {sigma2:.6f}")
print(f"Target detectable effect (delta): {delta_target:.4f}")
print(f"Required sample size per group (approx): {n_required:.1f}")
Estimated sigma^2 from control: 0.239585
Target detectable effect (delta): 0.0100
Required sample size per group (approx): 38333.6
6.1. Compare with actual sample sizes
n_ctrl = len(control)
n_2 = len(treat2)
n_3 = len(treat3)
print(f"Required n per group (approx): {n_required:.1f}")
print(f"Actual n (control, 1): {n_ctrl}")
print(f"Actual n (treatment, 2): {n_2}")
print(f"Actual n (treatment, 3): {n_3}")
Required n per group (approx): 38333.6
Actual n (control, 1): 826717
Actual n (treatment, 2): 824837
Actual n (treatment, 3): 824800
Interpretation of power / sample size
- If your actual group sizes (for control and each treatment) are larger than the
n_required, then your experiment is roughly properly powered to detect a 1 percentage point effect. - If your actual group sizes are smaller than
n_required, the experiment might be underpowered. In this case:- Failing to find a statistically significant effect does not mean there is no effect.
- It may simply mean the sample size is too small to reliably detect the effect you care about.
This is exactly the warning in the standard literature that “absence of evidence is not evidence of absence.”
7. Final Summary (for Marketing / Product Stakeholders)
Using the email subject line experiment, we evaluated two treatments against a control:
- SubjectLine 2 vs SubjectLine 1
- SubjectLine 3 vs SubjectLine 1
What the data shows
- SubjectLine 2 vs 1
- Uplift: ~ +0.05 percentage points
- 95% CI includes 0
- p-value ≈ 0.52
→ No statistically significant difference
- SubjectLine 3 vs 1
- Uplift: ~ –0.72 percentage points
- 95% CI entirely below 0
- p-value ≈ 0.00
→ Statistically significant decrease in performance
Business interpretation
- There is no evidence that SubjectLine 2 improves performance over the control.
- There is strong evidence that SubjectLine 3 performs worse than the control.
Decision:
- Keep SubjectLine 1 (baseline)
- SubjectLine 2 is neutral → no reason to roll out
- SubjectLine 3 should be rejected
Was the experiment reliable?
- Required sample size (for detecting ~1pp effect): ~38K per group
- Actual sample size: ~824K per group
→ The experiment is well-powered, meaning:
- Lack of significance for SubjectLine 2 is not due to low sample size
- The null result is credible
Key takeaways (Causal + Experimentation)
-
Randomization enables causal interpretation
Differences observed here can be interpreted as true treatment effects. - Statistical significance matters — but direction matters too
Not all treatments fail equally:- Some show no effect (SubjectLine 2)
- Some show negative impact (SubjectLine 3)
-
Confidence intervals are more informative than p-values alone
They show both magnitude and uncertainty of the effect. -
Underpowered vs well-powered experiments must be distinguished
Here, large sample sizes ensure conclusions are reliable. - Absence of lift ≠ missed opportunity
It often means the variant truly does not improve performance.
Practical takeaway
In real marketing experiments:
- Focus on both statistical significance and business impact
- Avoid rolling out variants with:
- Insignificant uplift
- Negative impact (even if small)
- Use power calculations upfront to ensure experiments are conclusive
This framework can be directly applied to:
- Subject line testing
- Creative optimization
- Pricing experiments
- Personalization strategies
Any setting where decisions depend on small but meaningful differences in response rates.