Maximum Likelihood Estimation
Deriving estimators by maximizing the probability of observed data under a parametric model
Historical Context
The principle of maximum likelihood was first articulated by R.A. Fisher in a series of papers beginning in 1912 and culminating in his landmark 1922 paper “On the mathematical foundations of theoretical statistics.” Fisher introduced the term “likelihood” to distinguish between the probability of data given parameters and the plausibility of parameters given data—a conceptual distinction that remains fundamental to statistical inference. His insight was that the best estimator of a parameter is the value that makes the observed data most probable.
Fisher also established the key asymptotic properties of MLEs: consistency, asymptotic normality, and efficiency. He introduced Fisher information as a measure of the amount of information a sample carries about a parameter, and proved the Cramér-Rao lower bound (independently discovered by C.R. Rao in 1945 and Harald Cramér in 1946) showing that the MLE achieves the smallest possible variance among unbiased estimators. The EM algorithm, developed by Dempster, Laird, and Rubin in 1977, extended the reach of maximum likelihood to incomplete-data problems, making it one of the most widely used algorithms in modern statistics and machine learning.
1.1 The Likelihood Function
Given independent observations $x_1, \ldots, x_n$ from a distribution with density $f(x \mid \theta)$, the likelihood function is defined as
viewed as a function of $\theta$ for fixed data. The likelihood is not a probability density in $\theta$—it need not integrate to one. It measures relative plausibility: if $L(\theta_1) > L(\theta_2)$, then $\theta_1$ is better supported by the data than $\theta_2$.
Definition: Log-Likelihood and Score Function
The log-likelihood is $\ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i \mid \theta)$. It transforms products into sums, which are both numerically stable and analytically tractable.
The score function is the gradient of the log-likelihood:
Under regularity conditions, $E_\theta[S(\theta)] = 0$—the score has mean zero at the true parameter value. The MLE satisfies $S(\hat{\theta}) = 0$.
Example (Normal distribution): For $X_i \sim N(\mu, \sigma^2)$with known $\sigma^2$, the log-likelihood is
Setting $S(\mu) = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) = 0$ yields the MLE $\hat{\mu} = \bar{x}$, the sample mean.
1.2 Properties of Maximum Likelihood Estimators
Under standard regularity conditions (the support of $f(x \mid \theta)$ does not depend on $\theta$, the parameter space is open, and $f$ is sufficiently smooth), the MLE possesses remarkable asymptotic properties.
Theorem: Asymptotic Properties of the MLE
Let $\theta_0$ be the true parameter value. Then:
1. Consistency: $\hat{\theta}_n \xrightarrow{P} \theta_0$ as $n \to \infty$. The MLE converges in probability to the true parameter.
2. Asymptotic Normality:
where $I(\theta_0)$ is the Fisher information.
3. Asymptotic Efficiency: The MLE achieves the Cramér-Rao lower bound asymptotically, making it the best among regular estimators.
4. Invariance: If $\hat{\theta}$ is the MLE of $\theta$, then $g(\hat{\theta})$ is the MLE of $g(\theta)$for any function $g$.
Consistency follows from the fact that the Kullback-Leibler divergence $D_{\text{KL}}(f_{\theta_0} \| f_\theta) \geq 0$ with equality only at $\theta = \theta_0$, so maximizing the log-likelihood is equivalent to minimizing KL divergence from the true distribution. Asymptotic normality follows from a Taylor expansion of the score function around $\theta_0$ and the central limit theorem.
1.3 Fisher Information and the Cramér-Rao Bound
Definition: Fisher Information
The Fisher information for a single observation is
For $n$ i.i.d. observations, the total information is $I_n(\theta) = n \cdot I(\theta)$. Fisher information measures the curvature of the log-likelihood at the true parameter: higher curvature means the data are more informative about $\theta$.
Theorem: Cramér-Rao Lower Bound
For any unbiased estimator $T$ of $\theta$,
An estimator that achieves this bound is called efficient. The MLE is asymptotically efficient: its variance approaches $1/I_n(\theta_0)$as $n \to \infty$.
In the multiparameter case, $\theta \in \mathbb{R}^p$, the Fisher information becomes a $p \times p$ matrix with entries
and the Cramér-Rao bound generalizes to $\text{Cov}(T) \succeq I_n(\theta)^{-1}$ in the matrix (Loewner) ordering.
Example (Bernoulli): For $X \sim \text{Bernoulli}(p)$, we have $I(p) = \frac{1}{p(1-p)}$. The MLE $\hat{p} = \bar{X}$ has variance $\frac{p(1-p)}{n} = \frac{1}{nI(p)}$, exactly matching the Cramér-Rao bound—so the sample proportion is efficient for all $n$, not just asymptotically.
1.4 Computing MLEs: Newton-Raphson and the EM Algorithm
When closed-form solutions are unavailable, we resort to iterative optimization. The Newton-Raphson method uses the score and observed information to iterate:
Replacing the observed Hessian with its expectation $-I_n(\theta^{(t)})$ gives the Fisher scoring algorithm, which has the same asymptotic convergence rate but can be more stable.
The EM Algorithm
For models with latent variables $Z$, the Expectation-Maximization (EM) algorithm iterates between two steps:
E-step: Compute the expected complete-data log-likelihood:
M-step: Maximize over $\theta$:
The EM algorithm guarantees $\ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)})$ at every iteration and converges to a local maximum (or saddle point) of the observed-data log-likelihood.
The EM algorithm is particularly useful for mixture models, where the latent variable indicates component membership. For a Gaussian mixture with $K$ components, $f(x \mid \theta) = \sum_{k=1}^{K} \pi_k \phi(x \mid \mu_k, \sigma_k^2)$, the E-step computes posterior responsibilities $\gamma_{ik} = P(Z_i = k \mid x_i, \theta^{(t)})$and the M-step updates each component’s parameters as weighted averages.
1.5 MLE for Common Models
For exponential family distributions $f(x \mid \theta) = h(x)\exp\!\big(\eta(\theta)^\top T(x) - A(\theta)\big)$, the MLE equations reduce to matching sufficient statistics: the MLE satisfies
Summary of Common MLEs
Normal($\mu, \sigma^2$): $\hat{\mu} = \bar{X}$, $\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$ (biased for $\sigma^2$)
Exponential($\lambda$): $\hat{\lambda} = 1/\bar{X}$
Poisson($\lambda$): $\hat{\lambda} = \bar{X}$
Gamma($\alpha, \beta$): No closed form; requires numerical optimization
Uniform(0, $\theta$): $\hat{\theta} = X_{(n)}$ (maximum order statistic)—note this violates regularity conditions, so standard asymptotics fail. The MLE converges at rate $n$ rather than $\sqrt{n}$.
The normal MLE for $\sigma^2$ illustrates an important subtlety: the MLE divides by $n$, producing a biased estimator, while the unbiased sample variance divides by $n-1$. Although biased, the MLE has smaller mean squared error for $n \geq 2$because the bias-variance tradeoff favors it. As $n \to \infty$ the difference vanishes.
Computational Lab: Maximum Likelihood in Practice
We visualize the log-likelihood surface for a normal model, implement Newton-Raphson to find the MLE of a gamma distribution, compute Fisher information numerically, and apply the EM algorithm to fit a Gaussian mixture.
Maximum Likelihood: Estimation and Algorithms
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
Practice Problems
Problem 1:Derive the MLE for both parameters $\mu$ and $\sigma^2$ of a normal distribution from a sample $x_1, \ldots, x_n$.
Solution:
1. Log-likelihood: $\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$.
2. Differentiate w.r.t. $\mu$: $\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0$.
3. Solve: $\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}$ (the sample mean).
4. Differentiate w.r.t. $\sigma^2$: $\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i - \mu)^2 = 0$.
5. Solve: $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2$ (the biased sample variance, dividing by $n$ not $n-1$).
6. The MLE $\hat{\sigma}^2$ is biased: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$. The unbiased estimator $s^2 = \frac{n}{n-1}\hat{\sigma}^2$ divides by $n-1$. Despite being biased, the MLE has smaller MSE for $n \geq 2$.
Problem 2:Derive the MLE for the rate parameter $\lambda$ of a Poisson distribution from observations $x_1, \ldots, x_n$.
Solution:
1. Poisson PMF: $P(X = x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}$.
2. Log-likelihood: $\ell(\lambda) = \sum_{i=1}^n [x_i \ln\lambda - \lambda - \ln(x_i!)] = \ln\lambda \sum x_i - n\lambda - \sum\ln(x_i!)$.
3. Score: $\frac{d\ell}{d\lambda} = \frac{\sum x_i}{\lambda} - n = 0$.
4. Solve: $\hat{\lambda} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}$ (the sample mean).
5. Verify it is a maximum: $\frac{d^2\ell}{d\lambda^2} = -\frac{\sum x_i}{\lambda^2} < 0$ for $\lambda > 0$.
6. The MLE is unbiased ($E[\bar{X}] = \lambda$) and equals the sufficient statistic. Its variance is $\text{Var}(\hat{\lambda}) = \lambda/n$, which achieves the Cramer-Rao bound since $I(\lambda) = 1/\lambda$ and $1/(nI) = \lambda/n$.
Problem 3:Derive the MLE for the rate parameter $\lambda$ of an exponential distribution $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$.
Solution:
1. Log-likelihood: $\ell(\lambda) = \sum_{i=1}^n \ln(\lambda e^{-\lambda x_i}) = n\ln\lambda - \lambda\sum_{i=1}^n x_i$.
2. Score: $\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum x_i = 0$.
3. Solve: $\hat{\lambda} = \frac{n}{\sum x_i} = \frac{1}{\bar{x}}$.
4. Second derivative: $\frac{d^2\ell}{d\lambda^2} = -\frac{n}{\lambda^2} < 0$. Confirmed maximum.
5. The MLE is biased: $E[1/\bar{X}] \neq \lambda$ (Jensen's inequality, since $1/x$ is convex). However, $\hat{\lambda}$ is consistent: $\hat{\lambda} \xrightarrow{P} \lambda$ as $n \to \infty$.
6. Fisher information: $I(\lambda) = 1/\lambda^2$. The Cramer-Rao bound is $\text{Var}(\hat{\lambda}) \geq \lambda^2/n$. The MLE achieves this bound asymptotically.
Problem 4:Compute the Fisher information $I(p)$ for a single observation from a Binomial($n, p$) distribution, and find the Cramer-Rao lower bound on the variance of any unbiased estimator of $p$.
Solution:
1. Binomial log-likelihood for one observation $X$: $\ell(p) = X\ln p + (n-X)\ln(1-p) + \ln\binom{n}{X}$.
2. Score: $S(p) = \frac{X}{p} - \frac{n-X}{1-p}$.
3. Fisher information: $I(p) = -E\!\left[\frac{d^2\ell}{dp^2}\right] = -E\!\left[-\frac{X}{p^2} - \frac{n-X}{(1-p)^2}\right]$.
4. Using $E[X] = np$: $I(p) = \frac{np}{p^2} + \frac{n - np}{(1-p)^2} = \frac{n}{p} + \frac{n}{1-p} = \frac{n}{p(1-p)}$.
5. Cramer-Rao bound: $\text{Var}(\hat{p}) \geq \frac{1}{I(p)} = \frac{p(1-p)}{n}$.
6. The MLE $\hat{p} = X/n$ has $\text{Var}(\hat{p}) = p(1-p)/n$, exactly matching the bound. So $\hat{p}$ is efficient (minimum variance unbiased) for all sample sizes, not just asymptotically. This is a special property of exponential family distributions.
Problem 5:For a Poisson distribution with parameter $\lambda$, find the Fisher information and use the Cramer-Rao bound to show that no unbiased estimator of $e^{-\lambda}$ (the probability of zero counts) can have variance less than $\frac{\lambda e^{-2\lambda}}{n}$.
Solution:
1. Fisher information for Poisson: $I(\lambda) = E\!\left[\left(\frac{X - \lambda}{\lambda}\right)^2\right] = \frac{\text{Var}(X)}{\lambda^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}$.
2. For $n$ i.i.d. observations: $I_n(\lambda) = n/\lambda$.
3. To apply the Cramer-Rao bound for $g(\lambda) = e^{-\lambda}$, use the generalized form: $\text{Var}(T) \geq \frac{[g'(\lambda)]^2}{I_n(\lambda)}$.
4. $g'(\lambda) = -e^{-\lambda}$, so $[g'(\lambda)]^2 = e^{-2\lambda}$.
5. Cramer-Rao bound: $\text{Var}(T) \geq \frac{e^{-2\lambda}}{n/\lambda} = \frac{\lambda e^{-2\lambda}}{n}$.
6. The MLE of $e^{-\lambda}$ is $e^{-\hat{\lambda}} = e^{-\bar{X}}$ (by the invariance property). This is biased for finite $n$ but is asymptotically efficient, achieving the Cramer-Rao bound as $n \to \infty$. An exactly unbiased estimator is $T = (1 - 1/n)^{\sum X_i}$.