Hypothesis Testing
Neyman–Pearson theory, likelihood ratio tests, common tests, multiple testing, and asymptotic methods
Historical Context
The modern theory of hypothesis testing was formalized by Jerzy Neyman and Egon Pearson in their landmark 1933 paper, building on earlier work by Karl Pearson (the chi-squared test, 1900) and Ronald Fisher (significance testing and the p-value concept, 1920s). The Neyman–Pearson lemma provided the first optimality result for testing. Student (William Sealy Gosset) introduced the t-test in 1908 while working at the Guinness brewery. The problem of multiple comparisons was addressed by Carlo Emilio Bonferroni (1936), and the modern False Discovery Rate (FDR) framework was introduced by Yoav Benjamini and Yosef Hochberg in their influential 1995 paper, which has become one of the most cited papers in statistics.
4.1 The Neyman–Pearson Framework
A hypothesis test involves a null hypothesis $H_0$ and an alternative $H_1$. A test $\phi(\mathbf{x}) \in \{0, 1\}$specifies whether to reject $H_0$ for each data outcome.
Error Types and Power
- Type I error (size): $\alpha = P(\text{reject } H_0 \mid H_0 \text{ true})$
- Type II error: $\beta = P(\text{fail to reject } H_0 \mid H_1 \text{ true})$
- Power: $1 - \beta = P(\text{reject } H_0 \mid H_1 \text{ true})$
The power function is $\pi(\theta) = P_\theta(\text{reject } H_0)$. A test has level $\alpha$ if $\sup_{\theta \in \Theta_0} \pi(\theta) \le \alpha$.
Neyman–Pearson Lemma
For testing $H_0: \theta = \theta_0$ vs. $H_1: \theta = \theta_1$(simple vs. simple), the most powerful level-$\alpha$ test rejects when the likelihood ratio exceeds a threshold:
where $k$ is chosen so that $P_{\theta_0}(\text{reject}) = \alpha$. This test is uniformly most powerful (UMP) among all level-$\alpha$ tests. For one-sided alternatives in exponential families, UMP tests exist; for two-sided alternatives, they generally do not.
4.2 Likelihood Ratio Tests
Generalized Likelihood Ratio
For composite hypotheses $H_0: \theta \in \Theta_0$vs. $H_1: \theta \in \Theta_0^c$:
Reject $H_0$ when $\Lambda$ is small (or equivalently when $-2\log\Lambda$ is large).
Wilks' Theorem
Under regularity conditions and under $H_0$:
where $r = \dim(\Theta) - \dim(\Theta_0)$ is the number of constrained parameters. This provides the asymptotic null distribution of the LRT statistic, enabling approximate p-value computation.
4.3 Common Hypothesis Tests
Z-Test (Known Variance)
Testing $H_0: \mu = \mu_0$ with known $\sigma^2$:
Student's T-Test (Unknown Variance)
Testing $H_0: \mu = \mu_0$ with unknown $\sigma^2$:
where $S^2 = \frac{1}{n-1}\sum(X_i - \bar{X})^2$. The two-sample t-test compares means of two groups. For unequal variances, the Welch approximation adjusts the degrees of freedom.
Chi-Squared and F Tests
Chi-squared goodness-of-fit: For testing observed counts $O_i$against expected $E_i$:
where $p$ is the number of estimated parameters.
F-test: For comparing two variances or in ANOVA,$F = \frac{S_1^2 / \sigma_1^2}{S_2^2 / \sigma_2^2} \sim F_{n_1 - 1, n_2 - 1}$under $H_0: \sigma_1^2 = \sigma_2^2$.
4.4 Multiple Testing Corrections
When performing $m$ simultaneous tests, the probability of at least one false rejection increases dramatically. Two key error rates:
Family-Wise Error Rate (FWER)
$\text{FWER} = P(\text{at least one false rejection})$.
Bonferroni correction: reject the $i$-th hypothesis if$p_i \le \alpha / m$. This controls FWER at level $\alpha$ by the union bound, but is conservative when tests are positively correlated.
Holm–Bonferroni: a step-down procedure that is uniformly more powerful. Order p-values $p_{(1)} \le \cdots \le p_{(m)}$ and reject$H_{(j)}$ for all $j \le k$ where $k = \max\{j : p_{(j)} \le \alpha / (m - j + 1)\}$.
False Discovery Rate (FDR)
$\text{FDR} = E\!\left[\frac{V}{R \vee 1}\right]$ where $V$ is the number of false discoveries and $R$ is total rejections.
Benjamini–Hochberg (BH) procedure: Order p-values and find $k = \max\{j : p_{(j)} \le \frac{j}{m}\alpha\}$; reject all $H_{(1)}, \ldots, H_{(k)}$. The BH procedure controls FDR at level $\alpha$ under independence (and under certain positive dependence conditions). FDR control is more appropriate than FWER control in large-scale testing scenarios such as genomics.
4.5 Asymptotic Tests: Wald, Score, and LRT
For testing $H_0: \theta = \theta_0$ with MLE $\hat{\theta}_n$and Fisher information $I(\theta)$, three asymptotically equivalent tests arise:
Wald Test
Evaluated at the MLE; does not require computation under $H_0$.
Score (Rao) Test
where $U(\theta) = \frac{\partial}{\partial\theta}\ell(\theta)$ is the score function. Evaluated at $\theta_0$; does not require the MLE. Useful when the MLE is difficult to compute.
Likelihood Ratio Test (LRT)
Requires both the restricted and unrestricted MLEs. All three statistics are asymptotically equivalent under $H_0$ and local alternatives, but may differ in finite samples. The LRT is generally preferred for its invariance properties.
Computational Laboratory
We perform power analysis, demonstrate the t-test and chi-squared test, and compare multiple testing correction methods.
Power Analysis, t-Test, Chi-Squared & Multiple Testing
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server