Part IV: Advanced Topics | Chapter 4

Causal Inference

From association to causation: potential outcomes, experiments, and structural models

Historical Context

The scientific study of causation has deep philosophical roots, from Hume's skepticism about inferring causes from observed regularities to Mill's methods of experimental inquiry. The statistical framework for causal inference took shape through two parallel traditions. Jerzy Neyman introduced potential outcomes in 1923 for randomized experiments, and Donald Rubin formalized the framework in the 1970s into what is now called the Rubin Causal Model, defining causal effects as comparisons of potential outcomes under different treatments for the same unit.

The second tradition, structural causal models (SCMs), was developed by Judea Pearl building on the path analysis of Sewall Wright (1921) and the structural equation models of econometrics. Pearl's do-calculus, introduced in the 1990s, provided a complete algorithmic framework for determining when causal effects are identifiable from observational data using directed acyclic graphs (DAGs). The synthesis of these approaches, along with practical methods like propensity score matching (Rosenbaum and Rubin, 1983), instrumental variables, and regression discontinuity designs, has made causal inference a central topic in statistics, epidemiology, economics, and the social sciences.

4.1 The Potential Outcomes Framework

Consider $n$ units, each potentially exposed to treatment ($W_i = 1$) or control ($W_i = 0$). For each unit $i$, there exist two potential outcomes: $Y_i(1)$ (outcome under treatment) and $Y_i(0)$ (outcome under control). The fundamental problem of causal inference is that we observe only one of these for each unit:$Y_i^{\text{obs}} = W_i Y_i(1) + (1 - W_i) Y_i(0)$.

Definition: Causal Effects

The individual treatment effect is $\tau_i = Y_i(1) - Y_i(0)$, which is never directly observable. Population-level estimands include:

$$\text{ATE} = \mathbb{E}[Y_i(1) - Y_i(0)], \qquad \text{ATT} = \mathbb{E}[Y_i(1) - Y_i(0) \mid W_i = 1]$$

The Average Treatment Effect (ATE) averages over the entire population, while the Average Treatment Effect on the Treated (ATT) conditions on those who actually received treatment.

SUTVA: Stable Unit Treatment Value Assumption

SUTVA requires: (i) no interference -- the potential outcomes for unit$i$ depend only on $i$'s own treatment, not on others' treatments; (ii) no hidden variations of treatment -- there is only one version of each treatment level. Violations occur in settings with spillover effects (e.g., vaccination, social networks) or when treatment delivery varies across units.

The naive estimator $\hat{\tau}_{\text{naive}} = \bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}$is biased for the ATE whenever treatment assignment is confounded:

$$\mathbb{E}[\hat{\tau}_{\text{naive}}] = \text{ATE} + \underbrace{\mathbb{E}[Y(0) \mid W=1] - \mathbb{E}[Y(0) \mid W=0]}_{\text{selection bias}}$$

Eliminating selection bias is the central challenge of causal inference.

4.2 Randomized Experiments

Randomization eliminates selection bias by construction. Under complete randomization with $n_1$ units assigned to treatment and $n_0 = n - n_1$ to control:

$$\mathbb{E}[\bar{Y}_{\text{treated}} - \bar{Y}_{\text{control}}] = \text{ATE}$$

Neyman's Variance Estimator

The variance of the difference-in-means estimator under complete randomization is:

$$\text{Var}(\hat{\tau}) = \frac{S_1^2}{n_1} + \frac{S_0^2}{n_0} - \frac{S_{10}^2}{n}$$

where $S_w^2 = \frac{1}{n-1}\sum_i (Y_i(w) - \bar{Y}(w))^2$ are the potential outcome variances and $S_{10}^2$ involves the variance of individual treatment effects. Since $S_{10}^2$ is not identifiable, the conservative Neyman estimator $\hat{V} = S_1^2/n_1 + S_0^2/n_0$(using sample variances within groups) is standard.

Fisher's sharp null hypothesis $H_0: Y_i(1) = Y_i(0)$ for all$i$ (no treatment effect for any unit) enables exact randomization inference. Under this null, all potential outcomes are observed, and the permutation distribution of the test statistic over all possible random assignments provides exact p-values. Stratified and blocked randomization can improve precision by ensuring balance on known covariates; the variance reduction is$\Delta V \propto (\bar{Y}_{s1}(0) - \bar{Y}_{s2}(0))^2$ where the subscripts index strata.

4.3 Observational Studies: Propensity Scores and IPW

When randomization is infeasible, identification of causal effects requires the strong ignorability assumption: $(Y(0), Y(1)) \perp\!\!\!\perp W \mid \mathbf{X}$(conditional independence) and $0 < P(W=1 \mid \mathbf{X}) < 1$ (positivity). Under these conditions, the ATE is identified from observational data.

Propensity Score

The propensity score is $e(\mathbf{x}) = P(W = 1 \mid \mathbf{X} = \mathbf{x})$. Rosenbaum and Rubin (1983) showed that if strong ignorability holds given$\mathbf{X}$, then it also holds given $e(\mathbf{X})$ alone:

$$(Y(0), Y(1)) \perp\!\!\!\perp W \mid e(\mathbf{X})$$

This reduces the dimensionality of the conditioning set to a single scalar. Propensity score matching pairs treated units with control units having similar propensity scores, while subclassification stratifies on the propensity score.

Inverse Probability Weighting (IPW)

The Horvitz-Thompson IPW estimator for the ATE is:

$$\hat{\tau}_{\text{IPW}} = \frac{1}{n}\sum_{i=1}^n \left(\frac{W_i Y_i}{\hat{e}(\mathbf{X}_i)} - \frac{(1-W_i)Y_i}{1 - \hat{e}(\mathbf{X}_i)}\right)$$

IPW reweights observations to create a pseudo-population where treatment assignment is independent of covariates. The estimator can have large variance when propensity scores are extreme. Stabilized weights $\tilde{w}_i = W_i \bar{e}/\hat{e}_i + (1-W_i)(1-\bar{e})/(1-\hat{e}_i)$and trimming (restricting to $\hat{e} \in [\epsilon, 1-\epsilon]$) address this. The doubly robust estimator combines IPW with outcome regression and is consistent if either the propensity score or the outcome model is correctly specified.

4.4 Structural Causal Models and DAGs

A structural causal model (SCM) consists of a collection of structural equations $X_j = f_j(\text{pa}(X_j), U_j)$ where $\text{pa}(X_j)$are the direct causes (parents) of $X_j$ and $U_j$ are exogenous noise variables. The causal graph is a directed acyclic graph (DAG) encoding these parent-child relationships.

The Do-Operator

Pearl's do-operator distinguishes intervention from conditioning. The interventional distribution $P(Y \mid \text{do}(X = x))$ is obtained by replacing the structural equation for $X$ with $X = x$, while keeping all other equations intact. This yields a modified (mutilated) graph.

$$P(Y \mid \text{do}(X = x)) \neq P(Y \mid X = x) \quad \text{in general}$$

The Backdoor Criterion

A set $\mathbf{Z}$ satisfies the backdoor criterion relative to $(X, Y)$ in a DAG $G$ if: (i) no node in $\mathbf{Z}$is a descendant of $X$, and (ii) $\mathbf{Z}$ blocks every path between $X$ and $Y$ that contains an arrow into$X$. When satisfied:

$$P(Y \mid \text{do}(X = x)) = \sum_{\mathbf{z}} P(Y \mid X = x, \mathbf{Z} = \mathbf{z}) P(\mathbf{Z} = \mathbf{z})$$

This is the adjustment formula, which expresses the causal effect in terms of observational quantities. Pearl's do-calculus provides three rules that are complete for identifying causal effects from any combination of observational and interventional data.

Key graphical concepts include d-separation (a criterion for reading conditional independencies from the DAG), colliders (conditioning on which opens spurious paths), and mediators (which transmit indirect effects). The front-door criterion enables identification even when no valid backdoor adjustment set exists, provided a mediating variable satisfies certain graphical conditions.

4.5 Instrumental Variables and Regression Discontinuity

When unobserved confounders exist (strong ignorability fails), alternative identification strategies exploit special features of the data or study design.

Instrumental Variables (IV)

An instrument $Z$ for the effect of $X$ on $Y$satisfies: (i) relevance: $Z$ affects $X$; (ii) exclusion: $Z$ affects $Y$ only through $X$; (iii) independence: $Z$ is independent of confounders. The IV estimand is:

$$\tau_{\text{IV}} = \frac{\text{Cov}(Y, Z)}{\text{Cov}(X, Z)} = \frac{\mathbb{E}[Y \mid Z=1] - \mathbb{E}[Y \mid Z=0]}{\mathbb{E}[X \mid Z=1] - \mathbb{E}[X \mid Z=0]}$$

This is the Wald estimator. With heterogeneous treatment effects and a monotonicity assumption, IV identifies the Local Average Treatment Effect(LATE) -- the ATE among compliers (units whose treatment status is changed by the instrument). Two-stage least squares (2SLS) generalizes to continuous instruments and multiple covariates.

Regression Discontinuity Design (RDD)

In the sharp RDD, treatment is deterministically assigned by whether a running variable $R_i$ exceeds a cutoff $c$:$W_i = \mathbf{1}(R_i \geq c)$. The causal effect at the cutoff is:

$$\tau_{\text{RDD}} = \lim_{r \downarrow c} \mathbb{E}[Y \mid R = r] - \lim_{r \uparrow c} \mathbb{E}[Y \mid R = r]$$

In the fuzzy RDD, the probability of treatment jumps discontinuously at $c$ but not from 0 to 1, yielding an IV-like estimator scaled by the jump in treatment probability. RDD requires only local continuity of potential outcome means at the cutoff, making it one of the most credible quasi-experimental designs.

Practical implementation of RDD involves choosing a bandwidth for local polynomial regression near the cutoff. The Imbens-Kalyanaraman or Calonico-Cattaneo-Titiunik bandwidth selectors balance bias and variance optimally. Validity checks include testing for manipulation of the running variable (McCrary density test) and examining covariate balance across the cutoff.

4.6 Computational Lab

We estimate the ATE in a randomized experiment, implement propensity score matching and IPW for an observational study, and demonstrate a regression discontinuity design.

Causal Inference: ATE, Propensity Scores, IPW, and RDD

Python

script.py201 lines

import numpy as np
np.random.seed(42)

# =============================================
# 1. ATE in a Randomized Experiment
# =============================================
print("=== Randomized Experiment ===")
n = 500
true_ate = 3.0

# Potential outcomes
Y0 = np.random.normal(10, 2, n)
Y1 = Y0 + true_ate + np.random.normal(0, 1, n)  # heterogeneous effects

# Random assignment
W = np.random.binomial(1, 0.5, n)
Y_obs = W * Y1 + (1 - W) * Y0

# Difference in means
ate_hat = np.mean(Y_obs[W==1]) - np.mean(Y_obs[W==0])
n1 = np.sum(W)
n0 = n - n1
se_hat = np.sqrt(np.var(Y_obs[W==1])/n1 + np.var(Y_obs[W==0])/n0)

print("True ATE:              %.4f" % true_ate)
print("Estimated ATE:         %.4f" % ate_hat)
print("Standard error:        %.4f" % se_hat)
print("95%% CI:               (%.4f, %.4f)" % (ate_hat - 1.96*se_hat, ate_hat + 1.96*se_hat))
print("Treated: %d, Control: %d" % (n1, n0))

# Fisher exact p-value (permutation test)
n_perms = 5000
perm_stats = np.zeros(n_perms)
for b in range(n_perms):
    W_perm = np.random.permutation(W)
    perm_stats[b] = np.mean(Y_obs[W_perm==1]) - np.mean(Y_obs[W_perm==0])
p_fisher = np.mean(np.abs(perm_stats) >= np.abs(ate_hat))
print("Fisher p-value:        %.4f" % p_fisher)

# =============================================
# 2. Propensity Score Matching (Observational)
# =============================================
print("")
print("=== Observational Study: Propensity Score Matching ===")
n_obs = 800
X1 = np.random.normal(0, 1, n_obs)
X2 = np.random.normal(0, 1, n_obs)

# Treatment depends on covariates (confounding)
logit_ps = 0.5 + 0.8*X1 + 0.4*X2
true_ps = 1 / (1 + np.exp(-logit_ps))
W_obs = np.random.binomial(1, true_ps)

# Outcome depends on covariates AND treatment
true_effect = 2.5
Y0_obs = 3 + 1.5*X1 + X2 + np.random.normal(0, 1, n_obs)
Y1_obs = Y0_obs + true_effect
Y_obs_data = W_obs * Y1_obs + (1 - W_obs) * Y0_obs

# Naive estimate (biased)
naive_ate = np.mean(Y_obs_data[W_obs==1]) - np.mean(Y_obs_data[W_obs==0])
print("True ATE:          %.4f" % true_effect)
print("Naive estimate:    %.4f (biased due to confounding)" % naive_ate)

# Estimate propensity score via logistic regression
# Simple manual implementation
def logistic_fit(X, y, lr=0.01, n_iter=1000):
    n, p = X.shape
    beta = np.zeros(p)
    for _ in range(n_iter):
        z = X @ beta
        prob = 1 / (1 + np.exp(-np.clip(z, -20, 20)))
        grad = X.T @ (y - prob) / n
        beta += lr * grad
    return beta

X_design = np.column_stack([np.ones(n_obs), X1, X2])
beta_hat = logistic_fit(X_design, W_obs)
ps_hat = 1 / (1 + np.exp(-X_design @ beta_hat))

print("Estimated PS range: (%.4f, %.4f)" % (ps_hat.min(), ps_hat.max()))

# Nearest-neighbor matching on propensity score
treated_idx = np.where(W_obs == 1)[0]
control_idx = np.where(W_obs == 0)[0]

matched_effects = []
for i in treated_idx:
    dists = np.abs(ps_hat[i] - ps_hat[control_idx])
    j = control_idx[np.argmin(dists)]
    matched_effects.append(Y_obs_data[i] - Y_obs_data[j])

ate_matched = np.mean(matched_effects)
se_matched = np.std(matched_effects) / np.sqrt(len(matched_effects))
print("PS Matched ATE:    %.4f (SE: %.4f)" % (ate_matched, se_matched))

# =============================================
# 3. Inverse Probability Weighting
# =============================================
print("")
print("=== Inverse Probability Weighting ===")

# Trim extreme propensity scores
trim = 0.05
mask_trim = (ps_hat > trim) & (ps_hat < 1 - trim)
n_trimmed = np.sum(~mask_trim)

# IPW estimator
ipw_treated = np.sum(W_obs[mask_trim] * Y_obs_data[mask_trim] / ps_hat[mask_trim]) / np.sum(W_obs[mask_trim] / ps_hat[mask_trim])
ipw_control = np.sum((1-W_obs[mask_trim]) * Y_obs_data[mask_trim] / (1-ps_hat[mask_trim])) / np.sum((1-W_obs[mask_trim]) / (1-ps_hat[mask_trim]))
ate_ipw = ipw_treated - ipw_control

# Horvitz-Thompson (unnormalized)
ate_ht = np.mean(W_obs[mask_trim] * Y_obs_data[mask_trim] / ps_hat[mask_trim] -
                 (1-W_obs[mask_trim]) * Y_obs_data[mask_trim] / (1-ps_hat[mask_trim]))

print("Trimmed observations:  %d" % n_trimmed)
print("IPW (Hajek) ATE:       %.4f" % ate_ipw)
print("IPW (Horvitz-Thompson): %.4f" % ate_ht)

# Check covariate balance after weighting
w_t = W_obs[mask_trim] / ps_hat[mask_trim]
w_c = (1-W_obs[mask_trim]) / (1-ps_hat[mask_trim])
x1_trim = X1[mask_trim]
x2_trim = X2[mask_trim]

bal_x1_before = np.mean(X1[W_obs==1]) - np.mean(X1[W_obs==0])
bal_x1_after = np.average(x1_trim, weights=w_t*W_obs[mask_trim]) - np.average(x1_trim, weights=w_c*(1-W_obs[mask_trim]))

print("")
print("Covariate balance (X1):")
print("  Before weighting: %.4f" % bal_x1_before)
print("  After weighting:  %.4f" % bal_x1_after)

# =============================================
# 4. Regression Discontinuity Design
# =============================================
print("")
print("=== Regression Discontinuity Design ===")
n_rdd = 1000
cutoff = 0.0
true_rdd_effect = 4.0

# Running variable
R = np.random.uniform(-3, 3, n_rdd)
# Treatment: sharp RDD
W_rdd = (R >= cutoff).astype(float)
# Outcome: smooth function of R plus treatment effect
Y_rdd = 2 + 0.5*R + 0.1*R**2 + true_rdd_effect * W_rdd + np.random.normal(0, 1, n_rdd)

# Local linear regression near cutoff
def local_linear_rdd(R, Y, W, cutoff, bandwidth):
    mask = np.abs(R - cutoff) <= bandwidth
    R_local = R[mask]
    Y_local = Y[mask]
    W_local = W[mask]

if np.sum(R_local < cutoff) < 5 or np.sum(R_local >= cutoff) < 5:
        return None, None

left = R_local < cutoff
    right = ~left

# Left side: Y = a0 + a1*(R-c)
    R_left = R_local[left] - cutoff
    Y_left = Y_local[left]
    X_left = np.column_stack([np.ones(np.sum(left)), R_left])
    beta_left = np.linalg.lstsq(X_left, Y_left, rcond=None)[0]

# Right side: Y = b0 + b1*(R-c)
    R_right = R_local[right] - cutoff
    Y_right = Y_local[right]
    X_right = np.column_stack([np.ones(np.sum(right)), R_right])
    beta_right = np.linalg.lstsq(X_right, Y_right, rcond=None)[0]

tau = beta_right[0] - beta_left[0]

res_left = Y_left - X_left @ beta_left
    res_right = Y_right - X_right @ beta_right
    se_left = np.sqrt(np.sum(res_left**2) / (np.sum(left)-2)) / np.sqrt(np.sum(left))
    se_right = np.sqrt(np.sum(res_right**2) / (np.sum(right)-2)) / np.sqrt(np.sum(right))
    se = np.sqrt(se_left**2 + se_right**2)

return tau, se

print("True RDD effect:  %.4f" % true_rdd_effect)
print("")
print("Bandwidth   Estimate   SE       n_local")
for bw in [0.5, 1.0, 1.5, 2.0, 2.5]:
    tau_rdd, se_rdd = local_linear_rdd(R, Y_rdd, W_rdd, cutoff, bw)
    n_local = np.sum(np.abs(R - cutoff) <= bw)
    if tau_rdd is not None:
        print("  %.1f       %7.4f   %.4f   %d" % (bw, tau_rdd, se_rdd, n_local))

# Global polynomial (biased but for comparison)
X_global = np.column_stack([np.ones(n_rdd), R, R**2, W_rdd])
beta_global = np.linalg.lstsq(X_global, Y_rdd, rcond=None)[0]
print("")
print("Global quadratic estimate: %.4f" % beta_global[3])
print("(Intercept=%.3f, R=%.3f, R^2=%.3f)" % (beta_global[0], beta_global[1], beta_global[2]))

Click Run to execute the Python code

Code will be executed with Python 3 on the server

4.7 Summary and Key Takeaways

Potential Outcomes

Causal effects are defined as comparisons of potential outcomes. The fundamental problem is that only one potential outcome is observed per unit. SUTVA and consistency assumptions connect potential and observed outcomes.

Randomized Experiments

Randomization eliminates selection bias and enables unbiased estimation of the ATE. Fisher's randomization inference provides exact p-values under the sharp null.

Propensity Scores and IPW

Under strong ignorability, propensity score methods reduce confounding adjustment to a single dimension. IPW reweights to create balance; doubly robust estimators provide extra protection against misspecification.

Structural Causal Models

DAGs encode causal assumptions graphically. The do-operator formalizes intervention, the backdoor criterion identifies valid adjustment sets, and do-calculus provides a complete identification algorithm.

IV and RDD

Instrumental variables identify LATE among compliers when ignorability fails. Regression discontinuity exploits treatment assignment cutoffs, requiring only local continuity for identification.

← Nonparametric Methods Course Overview →

Share:X Reddit LinkedIn