10: A/B Testing and Causal Inference

Theory to Practice with Dr. Demetri Pananos

(pdf)

Who Am I?

Causal Inference? In This Economy?

Causal Inference for Cheap

Agenda

Break

Stats Refresher

The Central Limit Theorem

The Central Limit Theorem tells us that the sample mean can be thought of as a normal random variable. Let $Y_1, \cdots, Y_n$ be i.i.d random variables so that $E[Y] = \mu$ and $\operatorname{Var}(Y) = \sigma^2$.

Recall that $$ \bar Y = \dfrac{1}{n} \sum_{i=1}^n Y_i $$

CLT says $$ \bar Y \sim \operatorname{Normal}(\mu, \tfrac{\sigma^2}{n}) $$

This means that $E[\bar Y] = \mu$ and $\operatorname{Var}(\bar Y) = \tfrac{\sigma^2}{n}$.

Thought Exercise: Calculus Grades

I want you to imagine:

Consider:

Example of CLT in Action

Hypothesis Tests For Means

We can perform a hypothesis test for the mean by computing the following test statistic

$$ Z = \dfrac{\bar y - \mu_0}{\sqrt{\dfrac{s^2}{n}}} $$

The test statistic for two means being equal is

$$ Z = \dfrac{\bar x - \bar y}{\sqrt{\tfrac{s^2_x}{n_x} + \tfrac{s^2_y}{n_y}}} $$

Confidence Intervals & P Values

Two ways to talk about a hypothesis test:

Pokemon and Causal Inference

Fossils at Mt. Moon

Rubin’s Causal Model & Potential Outcomes

Let $A_i=0, 1$ be an indicator for some treatment. Then $Y_i(A_i)$ is the potential outcome under that treatment.

Example: “My headache went away because I took an aspirin

We Never See Both Potential Outcomes

In reality, we never know both $Y_i(A_i=0)$ and $Y_i(A_i=1)$ – we can only ever see one.

We can relate the observed data to the potential outcomes via the switching equation

$$ Y_i = A_i Y_i(A_i=1) + (1-A_i) Y_i(A_i=0) $$

Causal Effects

If I knew both potential outcomes, I could compute the casual effect of the action

$$ \tau_i = Y_i(A_i=1) - Y_i(A_i=0) $$ Let’s assume $Y_i(A)$ can be 0 (no headache) or 1 (headache). What each of these mean about the treatment effect?

Average Causal Effects

We can talk about the average potential outcome $E[Y(A)]$. This is different than $E[Y\mid A]$!

Often, want to know how things change on average, so could compute

$$ \tau = E[Y_i(A_i=1)] - E[Y_i(A_i=0)] $$

or

$$ \lambda = \dfrac{E[Y_i(A_i=1)]}{E[Y_i(A_i=0)]} - 1 $$

How Can We Estimate Causal Effects?

If we can only ever see one potential outcome, how can we ever estimate causal effects?

Can’t we just look at people who took aspirin and did not take aspirin and see what the difference was?

What if:

Discuss

Confounding

Confounding: type of bias that occurs when the association you observe between an exposure and an outcome is distorted by a third variable

Note that generally $E[Y(A)] \neq E[Y|A]$! This is a subtle distinction here and you should make sure you understand this part.

Let’s look at an example

Confounding Example

# True causal effect:
  # Aspirin relieves headache
  #  = 2/5 - 3/5
mean(d$y1) - mean(d$y0)

# If we just looked at data:
  # Aspirin increases headache
  # = 2/3 - 1/2
mean(d$y[d$a==1]) - mean(d$y[d$a==0])

How Can We Do Causal Inference At All?

How can we ever do causal inference at all? Is this hopeless?

3 Assumptions for Causal Inference

I said generally $E[Y(A)] \neq E[Y|A]$, but under these 3 assumptions $E[Y(A)] = E[Y|A]$, and we can do causal inference!

If we have consistency, positivity, and exchangeability, then it is safe to assume $E[Y_i \mid A=a] = E[Y_i(A=a)]$.

Randomization Gives Us All 3!

The easiest way to achieve all 3 is via randomization.

So long as we can randomize, we can compare group means and that is a valid estimate of the causal effect!

Sounds Simple, Right?

Break

Practical A/B Testing

Steps To an A/B Test

  1. Convince people to run an A/B test
  2. Get Specific!
  3. Determine how long the test should run
  4. Align with engineering
  5. Run the test and monitor
  6. Analyze the test
  7. Communicate results

A “Real Example”

Convince People To Run an A/B Test

Get Specific!

Determine How Long the Test Should Run

Align With Engineering

Run The Test and Monitor

Run The Test and Monitor

chisq.test(c(34603, 34583), p = c(0.5, 0.5))

Analyze The Test

Analyze The Test

$$ \hat \lambda = \dfrac{\bar y_t}{\bar y_c} - 1$$

$$ \widehat{\mathrm{Var}}(\hat{\lambda}) ;=; \left(\frac{\bar{y}_t}{\bar{y}_c}\right)^2 \left( \frac{s_t^2}{n_t,\bar{y}_t^{2}} ;+; \frac{s_c^2}{n_c,\bar{y}_c^{2}} \right) $$

Communicate Results

“The treatment had an estimated lift of 4.31%. If we had run this test at a different time, or on a different group of users, we would expect the lift would be between -0.58% and 9.2%. Since the confidence interval contains negative lifts, my recommendation would be to ship control”

Communicate Results

The Peeking Problem

Always Valid Confidence Intervals

Link: https://docs.geteppo.com/statistics/confidence-intervals/statistical-nitty-gritty/#sequential

PMs Want To Go Fast

$$ MDE \approx \dfrac{2.8 \sqrt{2 \dfrac{\sigma^2}{N}}}{\mu}$$

CUPED

$$\operatorname{Var}(\tilde Y) = \operatorname{Var}(Y)(1-\rho^2) $$

CUPED: How It Works

  1. Pick a pre-experiment covariate $X$ (e.g. total revenue in the 28 days before the test).
  2. Compute the adjusted outcome for each user:

$$\tilde Y_i = Y_i - \theta (X_i - \bar X)$$

where $\theta = \operatorname{Cov}(Y, X) / \operatorname{Var}(X)$ (i.e. a regression coefficient).

  1. Run your usual z-test on $\tilde Y$ instead of $Y$.

Because $\operatorname{Var}(\tilde Y) = \operatorname{Var}(Y)(1 - \rho^2)$, a correlation of $\rho = 0.5$ cuts variance by 25%.

CUPED: Illustration

CUPED: Takeaways

A Lot To Learn

Fin