Part II: Inference | Chapter 4

Hypothesis Testing

Neyman–Pearson theory, likelihood ratio tests, common tests, multiple testing, and asymptotic methods

Historical Context

The modern theory of hypothesis testing was formalized by Jerzy Neyman and Egon Pearson in their landmark 1933 paper, building on earlier work by Karl Pearson (the chi-squared test, 1900) and Ronald Fisher (significance testing and the p-value concept, 1920s). The Neyman–Pearson lemma provided the first optimality result for testing. Student (William Sealy Gosset) introduced the t-test in 1908 while working at the Guinness brewery. The problem of multiple comparisons was addressed by Carlo Emilio Bonferroni (1936), and the modern False Discovery Rate (FDR) framework was introduced by Yoav Benjamini and Yosef Hochberg in their influential 1995 paper, which has become one of the most cited papers in statistics.

4.1 The Neyman–Pearson Framework

A hypothesis test involves a null hypothesis $H_0$ and an alternative $H_1$. A test $\phi(\mathbf{x}) \in \{0, 1\}$specifies whether to reject $H_0$ for each data outcome.

Error Types and Power

Type I error (size): $\alpha = P(\text{reject } H_0 \mid H_0 \text{ true})$
Type II error: $\beta = P(\text{fail to reject } H_0 \mid H_1 \text{ true})$
Power: $1 - \beta = P(\text{reject } H_0 \mid H_1 \text{ true})$

The power function is $\pi(\theta) = P_\theta(\text{reject } H_0)$. A test has level $\alpha$ if $\sup_{\theta \in \Theta_0} \pi(\theta) \le \alpha$.

Neyman–Pearson Lemma

For testing $H_0: \theta = \theta_0$ vs. $H_1: \theta = \theta_1$(simple vs. simple), the most powerful level-$\alpha$ test rejects when the likelihood ratio exceeds a threshold:

$$\text{Reject } H_0 \iff \frac{L(\theta_1 \mid \mathbf{x})}{L(\theta_0 \mid \mathbf{x})} > k$$

where $k$ is chosen so that $P_{\theta_0}(\text{reject}) = \alpha$. This test is uniformly most powerful (UMP) among all level-$\alpha$ tests. For one-sided alternatives in exponential families, UMP tests exist; for two-sided alternatives, they generally do not.

4.2 Likelihood Ratio Tests

Generalized Likelihood Ratio

For composite hypotheses $H_0: \theta \in \Theta_0$vs. $H_1: \theta \in \Theta_0^c$:

$$\Lambda(\mathbf{x}) = \frac{\sup_{\theta \in \Theta_0} L(\theta \mid \mathbf{x})}{\sup_{\theta \in \Theta} L(\theta \mid \mathbf{x})}$$

Reject $H_0$ when $\Lambda$ is small (or equivalently when $-2\log\Lambda$ is large).

Wilks' Theorem

Under regularity conditions and under $H_0$:

$$-2\log\Lambda(\mathbf{X}) \xrightarrow{d} \chi^2_r$$

where $r = \dim(\Theta) - \dim(\Theta_0)$ is the number of constrained parameters. This provides the asymptotic null distribution of the LRT statistic, enabling approximate p-value computation.

4.3 Common Hypothesis Tests

Z-Test (Known Variance)

Testing $H_0: \mu = \mu_0$ with known $\sigma^2$:

$$Z = \frac{\bar{X} - \mu_0}{\sigma / \sqrt{n}} \sim N(0, 1) \text{ under } H_0$$

Student's T-Test (Unknown Variance)

Testing $H_0: \mu = \mu_0$ with unknown $\sigma^2$:

$$T = \frac{\bar{X} - \mu_0}{S / \sqrt{n}} \sim t_{n-1} \text{ under } H_0$$

where $S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2$. The two-sample t-test compares means of two groups. For unequal variances, the Welch approximation adjusts the degrees of freedom.

Chi-Squared and F Tests

Chi-squared goodness-of-fit: For testing observed counts $O_i$against expected $E_i$:

$$\chi^2 = \sum_{i=1}^k \frac{(O_i - E_i)^2}{E_i} \xrightarrow{d} \chi^2_{k-1-p}$$

where $p$ is the number of estimated parameters.

F-test: For comparing two variances or in ANOVA,$F = \frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} \sim F_{n_1 - 1, n_2 - 1}$under $H_0: \sigma_1^2 = \sigma_2^2$.

4.4 Multiple Testing Corrections

When performing $m$ simultaneous tests, the probability of at least one false rejection increases dramatically. Two key error rates:

Family-Wise Error Rate (FWER)

$\text{FWER} = P(\text{at least one false rejection})$.

Bonferroni correction: reject the $i$-th hypothesis if$p_i \le \alpha / m$. This controls FWER at level $\alpha$ by the union bound, but is conservative when tests are positively correlated.

Holm–Bonferroni: a step-down procedure that is uniformly more powerful. Order p-values $p_{(1)} \le \cdots \le p_{(m)}$ and reject$H_{(j)}$ for all $j \le k$ where $k = \max\{j : p_{(j)} \le \alpha / (m - j + 1)\}$.

False Discovery Rate (FDR)

$\text{FDR} = E\!\left[\frac{V}{R \vee 1}\right]$ where $V$ is the number of false discoveries and $R$ is total rejections.

Benjamini–Hochberg (BH) procedure: Order p-values and find $k = \max\{j : p_{(j)} \le \frac{j}{m}\alpha\}$; reject all $H_{(1)}, \ldots, H_{(k)}$. The BH procedure controls FDR at level $\alpha$ under independence (and under certain positive dependence conditions). FDR control is more appropriate than FWER control in large-scale testing scenarios such as genomics.

4.5 Asymptotic Tests: Wald, Score, and LRT

For testing $H_0: \theta = \theta_0$ with MLE $\hat{\theta}_n$and Fisher information $I(\theta)$, three asymptotically equivalent tests arise:

Wald Test

$$W = \frac{(\hat{\theta}_n - \theta_0)^2}{[nI(\hat{\theta}_n)]^{-1}} = n(\hat{\theta}_n - \theta_0)^2 I(\hat{\theta}_n) \xrightarrow{d} \chi^2_1$$

Evaluated at the MLE; does not require computation under $H_0$.

Score (Rao) Test

$$S = \frac{U(\theta_0)^2}{nI(\theta_0)} \xrightarrow{d} \chi^2_1$$

where $U(\theta) = \frac{\partial}{\partial\theta}\ell(\theta)$ is the score function. Evaluated at $\theta_0$; does not require the MLE. Useful when the MLE is difficult to compute.

Likelihood Ratio Test (LRT)

$$-2[\ell(\theta_0) - \ell(\hat{\theta}_n)] \xrightarrow{d} \chi^2_1$$

Requires both the restricted and unrestricted MLEs. All three statistics are asymptotically equivalent under $H_0$ and local alternatives, but may differ in finite samples. The LRT is generally preferred for its invariance properties.

Computational Laboratory

We perform power analysis, demonstrate the t-test and chi-squared test, and compare multiple testing correction methods.

Power Analysis, t-Test, Chi-Squared & Multiple Testing

Python

script.py146 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.patch.set_facecolor('#0f172a')
for ax in axes.flat:
    ax.set_facecolor('#0f172a')
    ax.tick_params(colors='#d1d5db')
    ax.xaxis.label.set_color('#d1d5db')
    ax.yaxis.label.set_color('#d1d5db')
    for spine in ax.spines.values():
        spine.set_color('#10b981')

# --- Panel 1: Power curve for one-sample t-test ---
ax1 = axes[0, 0]
n_vals = [10, 30, 100]
effect_sizes = np.linspace(0, 1.5, 100)
alpha = 0.05
colors_list = ['#10b981', '#059669', '#34d399']
for n, c in zip(n_vals, colors_list):
    powers = []
    for d in effect_sizes:
        # Non-central t-distribution
        nc = d * np.sqrt(n)
        crit = stats.t.ppf(1 - alpha/2, df=n-1)
        power = 1 - stats.nct.cdf(crit, df=n-1, nc=nc) + stats.nct.cdf(-crit, df=n-1, nc=nc)
        powers.append(power)
    ax1.plot(effect_sizes, powers, color=c, linewidth=2, label=f'n={n}')
ax1.axhline(0.8, color='#f87171', linestyle='--', linewidth=1, alpha=0.7, label='Power = 0.8')
ax1.axhline(alpha, color='#fbbf24', linestyle=':', linewidth=1, alpha=0.7, label=f'alpha = {alpha}')
ax1.set_title('Power Curve: One-Sample t-Test', color='#10b981', fontsize=13, fontweight='bold')
ax1.set_xlabel("Effect size (Cohen's d)")
ax1.set_ylabel('Power')
ax1.legend(fontsize=8, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

# --- Panel 2: T-test simulation ---
ax2 = axes[0, 1]
n = 30
true_mu = 0.5  # true effect
null_mu = 0
n_sim = 10000
t_stats = []
for _ in range(n_sim):
    sample = np.random.normal(true_mu, 1, n)
    t_stat = (np.mean(sample) - null_mu) / (np.std(sample, ddof=1) / np.sqrt(n))
    t_stats.append(t_stat)
t_stats = np.array(t_stats)
x_grid = np.linspace(-5, 8, 300)
ax2.hist(t_stats, bins=80, density=True, alpha=0.5, color='#10b981', label='Simulated t-stats')
ax2.plot(x_grid, stats.t.pdf(x_grid, df=n-1), color='#f87171', linewidth=2,
         linestyle='--', label='Null t(29)')
nc_param = true_mu * np.sqrt(n)
ax2.plot(x_grid, stats.nct.pdf(x_grid, df=n-1, nc=nc_param), color='#6ee7b7',
         linewidth=2, label=f'Non-central t(nc={nc_param:.1f})')
crit_val = stats.t.ppf(1 - alpha/2, df=n-1)
ax2.axvline(crit_val, color='#fbbf24', linewidth=2, linestyle=':', label=f'Critical value')
ax2.axvline(-crit_val, color='#fbbf24', linewidth=2, linestyle=':')
reject_rate = np.mean(np.abs(t_stats) > crit_val)
ax2.set_title(f't-Test Simulation (power={reject_rate:.3f})', color='#10b981', fontsize=13, fontweight='bold')
ax2.set_xlabel('t-statistic')
ax2.legend(fontsize=7, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

# --- Panel 3: Chi-squared goodness of fit ---
ax3 = axes[1, 0]
# Simulate dice rolls - slightly loaded die
probs = np.array([0.12, 0.12, 0.18, 0.18, 0.20, 0.20])
n_rolls = 600
chi2_stats = []
for _ in range(5000):
    rolls = np.random.choice(6, size=n_rolls, p=probs)
    observed = np.bincount(rolls, minlength=6)
    expected = n_rolls / 6
    chi2_stat = np.sum((observed - expected)**2 / expected)
    chi2_stats.append(chi2_stat)
chi2_stats = np.array(chi2_stats)
x_chi = np.linspace(0, 30, 300)
ax3.hist(chi2_stats, bins=60, density=True, alpha=0.5, color='#10b981', label='Simulated chi2 stats')
ax3.plot(x_chi, stats.chi2.pdf(x_chi, df=5), color='#f87171', linewidth=2,
         linestyle='--', label='chi2(5) null')
crit_chi = stats.chi2.ppf(1 - alpha, df=5)
ax3.axvline(crit_chi, color='#fbbf24', linewidth=2, linestyle=':', label=f'Critical value = {crit_chi:.1f}')
reject_chi = np.mean(chi2_stats > crit_chi)
ax3.set_title(f'Chi-Squared Test (reject rate={reject_chi:.3f})', color='#10b981', fontsize=13, fontweight='bold')
ax3.set_xlabel('Chi-squared statistic')
ax3.legend(fontsize=8, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

# --- Panel 4: Multiple testing correction ---
ax4 = axes[1, 1]
m = 1000
m0 = 900   # true nulls
m1 = 100   # true alternatives
p_null = np.random.uniform(0, 1, m0)
p_alt = np.random.beta(1, 10, m1)  # small p-values for true effects
p_values = np.concatenate([p_null, p_alt])
is_null = np.array([True]*m0 + [False]*m1)

# Uncorrected
uncorrected_rej = p_values < alpha
# Bonferroni
bonf_rej = p_values < alpha / m
# BH procedure
sorted_idx = np.argsort(p_values)
sorted_p = p_values[sorted_idx]
bh_threshold = np.arange(1, m+1) / m * alpha
bh_max_k = 0
for j in range(m):
    if sorted_p[j] <= bh_threshold[j]:
        bh_max_k = j + 1
bh_rej = np.zeros(m, dtype=bool)
bh_rej[sorted_idx[:bh_max_k]] = True

methods = ['Uncorrected', 'Bonferroni', 'BH (FDR)']
true_pos = [np.sum(uncorrected_rej & ~is_null), np.sum(bonf_rej & ~is_null), np.sum(bh_rej & ~is_null)]
false_pos = [np.sum(uncorrected_rej & is_null), np.sum(bonf_rej & is_null), np.sum(bh_rej & is_null)]

x_pos = np.arange(len(methods))
width = 0.35
bars1 = ax4.bar(x_pos - width/2, true_pos, width, color='#10b981', label='True positives')
bars2 = ax4.bar(x_pos + width/2, false_pos, width, color='#f87171', label='False positives')
for bar, val in zip(bars1, true_pos):
    ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, str(val),
             ha='center', va='bottom', color='#d1d5db', fontsize=10)
for bar, val in zip(bars2, false_pos):
    ax4.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, str(val),
             ha='center', va='bottom', color='#d1d5db', fontsize=10)
ax4.set_xticks(x_pos)
ax4.set_xticklabels(methods, color='#d1d5db')
ax4.set_title('Multiple Testing Corrections', color='#10b981', fontsize=13, fontweight='bold')
ax4.set_ylabel('Count')
ax4.legend(fontsize=9, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

plt.tight_layout()
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0f172a')
print("Hypothesis testing laboratory complete.")
print(f"T-test power (simulated): {reject_rate:.4f}")
print(f"Chi-squared rejection rate (loaded die): {reject_chi:.4f}")
fdr_uncorr = false_pos[0] / max(np.sum(uncorrected_rej), 1)
fdr_bonf = false_pos[1] / max(np.sum(bonf_rej), 1)
fdr_bh = false_pos[2] / max(np.sum(bh_rej), 1)
print(f"FDR - Uncorrected: {fdr_uncorr:.3f}, Bonferroni: {fdr_bonf:.3f}, BH: {fdr_bh:.3f}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

←Bayesian Inference Maximum Likelihood→

Share:X Reddit LinkedIn