Chapter 21: Statistical Inference in Science
The philosophical foundations of the statistical methods that scientists actually use — and what the replication crisis reveals about their misuse.
The abstract philosophical debates about confirmation and induction acquire urgent practical significance when they are translated into the statistical methods that scientists use to analyse data. In the modern sciences — from particle physics to psychology, from genomics to economics — statistical inference is the principal tool by which hypotheses are evaluated and claims are adjudicated. Yet the philosophical foundations of these methods are far less secure than most practising scientists realise.
The dominant statistical paradigm in 20th-century science was a hybrid of two distinct frequentist approaches: R.A. Fisher’s significance testing and the Neyman-Pearson framework of hypothesis testing. These were developed independently, are based on different philosophical assumptions, and are in important respects incompatible. Yet they were merged into a single framework that most scientists learn as “the” method of statistical testing — a hybrid that neither Fisher nor Neyman would have endorsed.
The ongoing replication crisis — the discovery that a disturbing proportion of published scientific findings fail to replicate — has brought these philosophical issues to the forefront. Understanding the logic (and the illogic) of statistical inference is no longer a luxury for philosophers; it is a necessity for anyone who cares about the integrity of science.
Frequentist Statistics: Neyman-Pearson Hypothesis Testing
Jerzy Neyman and Egon Pearson (1933) developed a decision-theoretic framework for choosing between two hypotheses: the null hypothesis $H_0$ and the alternative hypothesis $H_1$. Their approach is based on controlling two types of error:
Type I error ($\alpha$): Rejecting $H_0$ when it is true (a “false positive”).
Type II error ($\beta$): Failing to reject $H_0$ when $H_1$ is true (a “false negative”).
The statistician chooses a significance level $\alpha$ (conventionally 0.05) and constructs a test that maximises the power $1 - \beta$ — the probability of correctly rejecting $H_0$ when $H_1$ is true. The fundamental theorem of Neyman and Pearson shows that the most powerful test is based on the likelihood ratio:
$$\Lambda = \frac{P(\text{data} | H_1)}{P(\text{data} | H_0)}$$
Philosophically, the Neyman-Pearson framework is a behavioural theory, not an epistemic one. It does not tell you what to believe about the hypotheses; it prescribes a procedure that controls error rates in the long run. If you always reject $H_0$ when $p < 0.05$, then at most 5% of your rejections will be false positives, in the long run.
“No test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of a hypothesis. But we may look at the purpose of tests from another viewpoint. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behaviour.”— Jerzy Neyman and Egon Pearson (1933)
This behavioural interpretation is both a strength and a limitation. It frees the framework from controversial assumptions about prior probabilities, but it also means that the result of a single test does not, strictly speaking, tell you anything about the truth of the hypotheses. The significance of this limitation is often lost on practising scientists.
Fisher’s Significance Testing: P-values and Their Interpretation
R.A. Fisher (1925, 1935) developed a conceptually different approach. For Fisher, statistical testing is not a decision procedure but a method for assessing the strength of evidence against a null hypothesis. The central concept is the p-value:
$$p\text{-value} = P(\text{data as extreme or more extreme} \mid H_0)$$
A small p-value indicates that the observed data would be improbable under the null hypothesis, providing evidence against it. Fisher did not advocate a fixed threshold; rather, he proposed that p-values be interpreted as a continuous measure of evidential strength.
Crucially, the p-value is not the probability that the null hypothesis is true. This is perhaps the single most common misinterpretation in all of science. The p-value is $P(\text{data} | H_0)$, not $P(H_0 | \text{data})$. To obtain the latter, one would need Bayes’ theorem and a prior probability for $H_0$ — exactly what the frequentist rejects.
“The p-value is not the probability that the null hypothesis is true, or the probability that the alternative hypothesis is false. It is not the probability that the observed effects were produced by random chance alone.”— American Statistical Association Statement on P-values (2016)
Fisher’s approach differs from Neyman-Pearson in crucial ways: Fisher does not require a specified alternative hypothesis, does not speak of Type II errors or power, and treats the p-value as evidence rather than as input to a binary decision. The textbook procedure of “reject if $p < 0.05$” is a bastardisation of both approaches — it combines Fisher’s p-value with Neyman-Pearson’s decision threshold in a way that neither statistician intended.
Gigerenzer (2004) has documented how this hybrid was propagated through statistics textbooks as though it were a single coherent method, creating what he calls the “null ritual” — a mindless procedure that often substitutes for genuine scientific reasoning.
The P-value Crisis and the Replication Crisis
The replication crisis refers to the discovery, beginning around 2011, that a disturbingly large proportion of published scientific findings — particularly in psychology, but also in biomedicine, economics, and other fields — fail to replicate when independent researchers attempt to reproduce them.
The Open Science Collaboration (2015) attempted to replicate 100 psychology studies published in leading journals. Only 36% produced statistically significant results in the same direction as the original, and the mean effect size was roughly half the original. Similar findings have emerged in cancer biology (Begley and Ellis, 2012) and economics (Camerer et al., 2016).
Several statistical pathologies contribute to this crisis:
- ◆P-hacking: Trying multiple analyses and reporting only the one that produces $p < 0.05$. If you run 20 independent tests at $\alpha = 0.05$, you expect one false positive on average.
- ◆HARKing: Hypothesising After Results are Known — presenting post hoc hypotheses as if they were a priori predictions.
- ◆Publication bias: Journals preferentially publish statistically significant results, creating a “file drawer” of null results that never see the light of day.
- ◆Low statistical power: Many studies are too small to reliably detect the effects they seek. When combined with publication bias, low power paradoxically increases the rate of false positives among published results (Ioannidis, 2005).
“It can be proven that most claimed research findings are false.”— John Ioannidis, “Why Most Published Research Findings Are False” (2005)
Ioannidis’s provocative argument uses a Bayesian framework to show that when the prior probability of a hypothesis being true is low (as in exploratory research), even a test with $p < 0.05$ is more likely to be a false positive than a true positive. This is a straightforward consequence of Bayes’ theorem and the base rate fallacy — yet it was a revelation to much of the scientific community.
Bayesian Statistics in Scientific Practice
Bayesian methods offer an alternative to frequentist testing that many philosophers and statisticians consider more principled. Instead of asking “How surprising is the data if the null is true?”, the Bayesian asks “How do my beliefs about the hypotheses change in light of the data?”
In practice, Bayesian hypothesis testing often uses the Bayes factor — the ratio of the marginal likelihoods of the data under the two hypotheses:
$$BF_{10} = \frac{P(\text{data} | H_1)}{P(\text{data} | H_0)} = \frac{\int P(\text{data} | \theta, H_1) P(\theta | H_1) d\theta}{\int P(\text{data} | \theta, H_0) P(\theta | H_0) d\theta}$$
The Bayes factor has several philosophical advantages over p-values:
- ◆It can provide evidence for the null hypothesis, not merely fail to reject it.
- ◆It is not affected by the stopping rule — evidence can be accumulated until a clear conclusion emerges.
- ◆It automatically penalises overly complex models through the marginal likelihood integral (an “Ockham factor”).
- ◆It provides a continuous measure of evidence, not a binary accept/reject decision.
However, Bayesian methods face their own challenges. The choice of prior can influence results substantially, especially with small samples. The computation of marginal likelihoods can be technically demanding. And the philosophical objections to subjective priors remain relevant. Nevertheless, Bayesian methods are increasingly adopted in fields from genomics to astrophysics to clinical trials.
Likelihoodism
A third approach, likelihoodism, championed by philosophers Richard Royall (1997) and Elliott Sober (2008), attempts to steer a middle course between Bayesianism and frequentism. The likelihoodist holds that the evidential meaning of data is fully captured by the likelihood ratio:
$$\frac{P(E | H_1)}{P(E | H_0)}$$
This is codified in the Law of Likelihood: evidence $E$ favours $H_1$over $H_0$ if and only if $P(E|H_1) > P(E|H_0)$, and the degree of favouring is measured by the likelihood ratio.
Royall argued that likelihoodism cleanly separates three distinct questions that Bayesianism and frequentism conflate:
1. What does the evidence say? (Answered by the likelihood ratio.)
2. What should I believe? (Requires priors — a Bayesian question.)
3. What should I do? (Requires utilities — a decision-theoretic question.)
The likelihoodist insists that only question 1 is a purely evidential question, and the likelihood ratio answers it without requiring prior probabilities. Critics object that likelihoodism is too austere: it cannot tell ushow much to believe a hypothesis, only which of two hypotheses the evidence favours. In cases involving more than two hypotheses, or when we need to make decisions, the likelihoodist framework must be supplemented.
Model Selection: AIC, BIC, and Bayesian Model Comparison
Much of scientific inference involves choosing among competing models — mathematical representations of data-generating processes. The challenge is to balance goodness-of-fit (how well the model fits the observed data) against complexity (the number of adjustable parameters). A model with enough free parameters can fit any data perfectly, but it will not generalise well to new data.
The Akaike Information Criterion (AIC), introduced by Hirotugu Akaike in 1973, provides a principled trade-off:
$$\text{AIC} = -2 \ln \hat{L} + 2k$$
where $\hat{L}$ is the maximum likelihood and $k$ is the number of free parameters. The first term rewards fit; the second penalises complexity. The model with the lowest AIC is preferred.
The Bayesian Information Criterion (BIC), developed by Gideon Schwarz (1978), imposes a stronger penalty on complexity:
$$\text{BIC} = -2 \ln \hat{L} + k \ln n$$
where $n$ is the sample size. The BIC penalty grows with sample size, making it more parsimonious than AIC for large samples. Philosophically, AIC and BIC embody different goals: AIC estimates predictive accuracy, while BIC approximates the Bayesian posterior probability of the model.
Elliott Sober (2015) has argued that the philosophical significance of model selection criteria lies in their demonstration that simplicity is not merely an aesthetic virtue but has a precise epistemic justification — simpler models are preferred because they generalise better, not because the world is metaphysically simple. This connects model selection to deep philosophical questions about Ockham’s razor and the relationship between simplicity and truth.
Mayo’s Error Statistics and Severe Testing
Deborah Mayo (1996, 2018) has developed a sophisticated philosophy of statistics that she calls error statistics. Mayo defends a reformed version of frequentist testing centred on the concept of severe testing:
Severity Principle: Data $x$ provide good evidence for hypothesis $H$ (just to the extent that) $H$ passes a severe test with $x$ — a test that would, with high probability, have produced a result that accords less well with $H$ if $H$ were false.
Formally, the severity of a test is:
$$\text{SEV}(H, x) = P(\text{test would produce a result less supportive of } H \text{ than } x \mid H \text{ is false})$$
“For a claim to pass a severe test is to survive a procedure that would, with very high probability, have found it to be in error if it were in error.”— Deborah Mayo, Statistical Inference as Severe Testing (2018)
Mayo’s framework addresses many of the pathologies that produce the replication crisis. A result produced by p-hacking or optional stopping does not pass a severe test, because the test procedure was not one that would have reliably detected the error if the hypothesis were false. Severity thus provides a principled criterion for distinguishing genuine evidence from statistical artefacts.
Mayo explicitly positions her approach against both Bayesianism and the crude hybrid of Fisher and Neyman-Pearson. Against the Bayesians, she argues that the severity of a test is determined by the error-probing capacity of the testing procedure, not by prior probabilities. Against the hybrid, she argues that the mechanical application of significance thresholds without attention to the severity of the test is precisely what has led to the replication crisis.
Statistical Explanation and Causal Inference
A final set of philosophical issues concerns the relationship between statistical association and causal explanation. The dictum “correlation does not imply causation” is well known, but the positive question — under what conditions can we infer causation from statistical data? — is far more difficult.
The gold standard for causal inference is the randomised controlled trial (RCT), in which subjects are randomly assigned to treatment and control groups. Randomisation ensures that, in expectation, the two groups differ only in their exposure to the treatment, allowing a causal interpretation of any observed difference in outcomes.
When RCTs are impossible (as in much of epidemiology, economics, and the social sciences), researchers must rely on observational data and employ techniques such as:
- ◆Regression analysis with control variables
- ◆Instrumental variables exploiting natural experiments
- ◆Directed acyclic graphs (DAGs) formalising causal structure (Pearl, 2000)
- ◆Difference-in-differences and regression discontinuity designs
Judea Pearl’s influential work on causal inference using structural causal models and the do-calculushas provided a rigorous mathematical framework for reasoning about causation from observational data. Pearl distinguishes between seeing ($P(Y|X)$) and doing ($P(Y|do(X))$):
$$P(Y | do(X)) \neq P(Y | X) \text{ in general}$$
The philosophical significance of Pearl’s work is that it shows causation cannot be reduced to statistical association — causal reasoning requires additional assumptions about the structure of the data-generating process that go beyond what the data alone can tell us. This reinforces the broader lesson of this chapter: statistical inference is never purely mechanical; it always rests on philosophical assumptions.