Part III: Methods | Chapter 1

Maximum Likelihood Estimation

Deriving estimators by maximizing the probability of observed data under a parametric model

Historical Context

The principle of maximum likelihood was first articulated by R.A. Fisher in a series of papers beginning in 1912 and culminating in his landmark 1922 paper “On the mathematical foundations of theoretical statistics.” Fisher introduced the term “likelihood” to distinguish between the probability of data given parameters and the plausibility of parameters given data—a conceptual distinction that remains fundamental to statistical inference. His insight was that the best estimator of a parameter is the value that makes the observed data most probable.

Fisher also established the key asymptotic properties of MLEs: consistency, asymptotic normality, and efficiency. He introduced Fisher information as a measure of the amount of information a sample carries about a parameter, and proved the Cramér-Rao lower bound (independently discovered by C.R. Rao in 1945 and Harald Cramér in 1946) showing that the MLE achieves the smallest possible variance among unbiased estimators. The EM algorithm, developed by Dempster, Laird, and Rubin in 1977, extended the reach of maximum likelihood to incomplete-data problems, making it one of the most widely used algorithms in modern statistics and machine learning.

1.1 The Likelihood Function

Given independent observations $x_1, \ldots, x_n$ from a distribution with density $f(x \mid \theta)$, the likelihood function is defined as

$$L(\theta) = \prod_{i=1}^{n} f(x_i \mid \theta)$$

viewed as a function of $\theta$ for fixed data. The likelihood is not a probability density in $\theta$—it need not integrate to one. It measures relative plausibility: if $L(\theta_1) > L(\theta_2)$, then $\theta_1$ is better supported by the data than $\theta_2$.

Definition: Log-Likelihood and Score Function

The log-likelihood is $\ell(\theta) = \log L(\theta) = \sum_{i=1}^{n} \log f(x_i \mid \theta)$. It transforms products into sums, which are both numerically stable and analytically tractable.

The score function is the gradient of the log-likelihood:

$$S(\theta) = \frac{\partial}{\partial \theta} \ell(\theta) = \sum_{i=1}^{n} \frac{\partial}{\partial \theta} \log f(x_i \mid \theta)$$

Under regularity conditions, $E_\theta[S(\theta)] = 0$—the score has mean zero at the true parameter value. The MLE satisfies $S(\hat{\theta}) = 0$.

Example (Normal distribution): For $X_i \sim N(\mu, \sigma^2)$with known $\sigma^2$, the log-likelihood is

$$\ell(\mu) = -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(x_i - \mu)^2$$

Setting $S(\mu) = \frac{1}{\sigma^2}\sum_{i=1}^{n}(x_i - \mu) = 0$ yields the MLE $\hat{\mu} = \bar{x}$, the sample mean.

1.2 Properties of Maximum Likelihood Estimators

Under standard regularity conditions (the support of $f(x \mid \theta)$ does not depend on $\theta$, the parameter space is open, and $f$ is sufficiently smooth), the MLE possesses remarkable asymptotic properties.

Theorem: Asymptotic Properties of the MLE

Let $\theta_0$ be the true parameter value. Then:

1. Consistency: $\hat{\theta}_n \xrightarrow{P} \theta_0$ as $n \to \infty$. The MLE converges in probability to the true parameter.

2. Asymptotic Normality:

$$\sqrt{n}(\hat{\theta}_n - \theta_0) \xrightarrow{d} N\bigl(0, I(\theta_0)^{-1}\bigr)$$

where $I(\theta_0)$ is the Fisher information.

3. Asymptotic Efficiency: The MLE achieves the Cramér-Rao lower bound asymptotically, making it the best among regular estimators.

4. Invariance: If $\hat{\theta}$ is the MLE of $\theta$, then $g(\hat{\theta})$ is the MLE of $g(\theta)$for any function $g$.

Consistency follows from the fact that the Kullback-Leibler divergence $D_{\text{KL}}(f_{\theta_0} \| f_\theta) \geq 0$ with equality only at $\theta = \theta_0$, so maximizing the log-likelihood is equivalent to minimizing KL divergence from the true distribution. Asymptotic normality follows from a Taylor expansion of the score function around $\theta_0$ and the central limit theorem.

1.3 Fisher Information and the Cramér-Rao Bound

Definition: Fisher Information

The Fisher information for a single observation is

$$I(\theta) = E_\theta\!\left[\left(\frac{\partial}{\partial \theta}\log f(X \mid \theta)\right)^{\!2}\right] = -E_\theta\!\left[\frac{\partial^2}{\partial \theta^2}\log f(X \mid \theta)\right]$$

For $n$ i.i.d. observations, the total information is $I_n(\theta) = n \cdot I(\theta)$. Fisher information measures the curvature of the log-likelihood at the true parameter: higher curvature means the data are more informative about $\theta$.

Theorem: Cramér-Rao Lower Bound

For any unbiased estimator $T$ of $\theta$,

$$\text{Var}_\theta(T) \geq \frac{1}{I_n(\theta)} = \frac{1}{n \cdot I(\theta)}$$

An estimator that achieves this bound is called efficient. The MLE is asymptotically efficient: its variance approaches $1/I_n(\theta_0)$as $n \to \infty$.

In the multiparameter case, $\theta \in \mathbb{R}^p$, the Fisher information becomes a $p \times p$ matrix with entries

$$[I(\theta)]_{jk} = E_\theta\!\left[\frac{\partial \log f}{\partial \theta_j}\frac{\partial \log f}{\partial \theta_k}\right] = -E_\theta\!\left[\frac{\partial^2 \log f}{\partial \theta_j \partial \theta_k}\right]$$

and the Cramér-Rao bound generalizes to $\text{Cov}(T) \succeq I_n(\theta)^{-1}$ in the matrix (Loewner) ordering.

Example (Bernoulli): For $X \sim \text{Bernoulli}(p)$, we have $I(p) = \frac{1}{p(1-p)}$. The MLE $\hat{p} = \bar{X}$ has variance $\frac{p(1-p)}{n} = \frac{1}{nI(p)}$, exactly matching the Cramér-Rao bound—so the sample proportion is efficient for all $n$, not just asymptotically.

1.4 Computing MLEs: Newton-Raphson and the EM Algorithm

When closed-form solutions are unavailable, we resort to iterative optimization. The Newton-Raphson method uses the score and observed information to iterate:

$$\theta^{(t+1)} = \theta^{(t)} - \left[\frac{\partial^2 \ell}{\partial \theta \partial \theta^\top}\bigg|_{\theta^{(t)}}\right]^{-1} S(\theta^{(t)})$$

Replacing the observed Hessian with its expectation $-I_n(\theta^{(t)})$ gives the Fisher scoring algorithm, which has the same asymptotic convergence rate but can be more stable.

The EM Algorithm

For models with latent variables $Z$, the Expectation-Maximization (EM) algorithm iterates between two steps:

E-step: Compute the expected complete-data log-likelihood:

$$Q(\theta \mid \theta^{(t)}) = E_{Z \mid X, \theta^{(t)}}\!\left[\log f(X, Z \mid \theta)\right]$$

M-step: Maximize over $\theta$:

$$\theta^{(t+1)} = \arg\max_\theta \, Q(\theta \mid \theta^{(t)})$$

The EM algorithm guarantees $\ell(\theta^{(t+1)}) \geq \ell(\theta^{(t)})$ at every iteration and converges to a local maximum (or saddle point) of the observed-data log-likelihood.

The EM algorithm is particularly useful for mixture models, where the latent variable indicates component membership. For a Gaussian mixture with $K$ components, $f(x \mid \theta) = \sum_{k=1}^{K} \pi_k \phi(x \mid \mu_k, \sigma_k^2)$, the E-step computes posterior responsibilities $\gamma_{ik} = P(Z_i = k \mid x_i, \theta^{(t)})$and the M-step updates each component’s parameters as weighted averages.

1.5 MLE for Common Models

For exponential family distributions $f(x \mid \theta) = h(x)\exp\!\big(\eta(\theta)^\top T(x) - A(\theta)\big)$, the MLE equations reduce to matching sufficient statistics: the MLE satisfies

$$E_{\hat{\theta}}[T(X)] = \frac{1}{n}\sum_{i=1}^{n} T(x_i)$$

Summary of Common MLEs

Normal($\mu, \sigma^2$): $\hat{\mu} = \bar{X}$, $\hat{\sigma}^2 = \frac{1}{n}\sum(X_i - \bar{X})^2$ (biased for $\sigma^2$)

Exponential($\lambda$): $\hat{\lambda} = 1/\bar{X}$

Poisson($\lambda$): $\hat{\lambda} = \bar{X}$

Gamma($\alpha, \beta$): No closed form; requires numerical optimization

Uniform(0, $\theta$): $\hat{\theta} = X_{(n)}$ (maximum order statistic)—note this violates regularity conditions, so standard asymptotics fail. The MLE converges at rate $n$ rather than $\sqrt{n}$.

The normal MLE for $\sigma^2$ illustrates an important subtlety: the MLE divides by $n$, producing a biased estimator, while the unbiased sample variance divides by $n-1$. Although biased, the MLE has smaller mean squared error for $n \geq 2$because the bias-variance tradeoff favors it. As $n \to \infty$ the difference vanishes.

Computational Lab: Maximum Likelihood in Practice

We visualize the log-likelihood surface for a normal model, implement Newton-Raphson to find the MLE of a gamma distribution, compute Fisher information numerically, and apply the EM algorithm to fit a Gaussian mixture.

Maximum Likelihood: Estimation and Algorithms

Python

script.py120 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)

fig, axes = plt.subplots(2, 2, figsize=(14, 11))
fig.patch.set_facecolor('#0f172a')
for ax in axes.flat:
    ax.set_facecolor('#0f172a')
    ax.tick_params(colors='#94a3b8')
    ax.xaxis.label.set_color('#94a3b8')
    ax.yaxis.label.set_color('#94a3b8')
    for spine in ax.spines.values():
        spine.set_color('#334155')

# --- Panel 1: Log-likelihood surface for Normal(mu, sigma^2) ---
data = np.random.normal(3.0, 1.5, size=50)
mu_grid = np.linspace(1.5, 4.5, 100)
sig_grid = np.linspace(0.5, 3.0, 100)
MU, SIG = np.meshgrid(mu_grid, sig_grid)
LL = np.zeros_like(MU)
for i in range(len(sig_grid)):
    for j in range(len(mu_grid)):
        LL[i, j] = -len(data)/2*np.log(2*np.pi*SIG[i,j]**2) - np.sum((data - MU[i,j])**2)/(2*SIG[i,j]**2)

ax1 = axes[0, 0]
cs = ax1.contourf(MU, SIG, LL, levels=30, cmap='Greens')
mu_mle = np.mean(data)
sig_mle = np.std(data)
ax1.plot(mu_mle, sig_mle, 'r*', markersize=15, label=f'MLE ({mu_mle:.2f}, {sig_mle:.2f})')
ax1.set_xlabel('mu', fontsize=11)
ax1.set_ylabel('sigma', fontsize=11)
ax1.set_title('Log-Likelihood Surface (Normal)', color='#6ee7b7', fontsize=13, fontweight='bold')
ax1.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#e2e8f0', fontsize=9)

# --- Panel 2: Newton-Raphson for Gamma MLE ---
true_alpha, true_beta = 3.0, 2.0
gamma_data = np.random.gamma(true_alpha, 1.0/true_beta, size=200)
xbar = np.mean(gamma_data)
logxbar = np.mean(np.log(gamma_data))

from scipy.special import digamma, polygamma
alpha_vals = [1.0]
for _ in range(15):
    a = alpha_vals[-1]
    score = len(gamma_data)*(np.log(a) - digamma(a) + logxbar - np.log(xbar))
    info = len(gamma_data)*(1.0/a - polygamma(1, a))
    a_new = a - score / info
    if a_new <= 0:
        a_new = a / 2
    alpha_vals.append(a_new)
    if abs(a_new - a) < 1e-10:
        break

ax2 = axes[0, 1]
ax2.plot(range(len(alpha_vals)), alpha_vals, 'o-', color='#10b981', linewidth=2, markersize=6)
ax2.axhline(true_alpha, color='#f87171', linestyle='--', linewidth=1.5, label=f'True alpha = {true_alpha}')
ax2.set_xlabel('Iteration', fontsize=11)
ax2.set_ylabel('alpha estimate', fontsize=11)
ax2.set_title('Newton-Raphson for Gamma MLE', color='#6ee7b7', fontsize=13, fontweight='bold')
ax2.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#e2e8f0', fontsize=9)

# --- Panel 3: Fisher Information vs sample size ---
ns = np.arange(10, 501, 10)
p_true = 0.3
fisher_bound = p_true*(1-p_true) / ns
empirical_var = []
for n in ns:
    reps = np.array([np.mean(np.random.binomial(1, p_true, n)) for _ in range(2000)])
    empirical_var.append(np.var(reps))

ax3 = axes[1, 0]
ax3.plot(ns, empirical_var, color='#34d399', linewidth=2, label='Empirical Var(p-hat)')
ax3.plot(ns, fisher_bound, '--', color='#f87171', linewidth=2, label='Cramer-Rao Bound')
ax3.set_xlabel('Sample size n', fontsize=11)
ax3.set_ylabel('Variance', fontsize=11)
ax3.set_title('Fisher Information & Cramer-Rao Bound', color='#6ee7b7', fontsize=13, fontweight='bold')
ax3.legend(facecolor='#1e293b', edgecolor='#334155', labelcolor='#e2e8f0', fontsize=9)

# --- Panel 4: EM for Gaussian Mixture ---
n1, n2 = 150, 100
mix_data = np.concatenate([np.random.normal(-2, 0.8, n1), np.random.normal(3, 1.2, n2)])
np.random.shuffle(mix_data)

pi1, mu1, s1 = 0.5, 0.0, 1.0
pi2, mu2, s2 = 0.5, 1.0, 1.0
ll_hist = []

for it in range(40):
    # E-step
    g1 = pi1 * np.exp(-0.5*((mix_data-mu1)/s1)**2) / (s1*np.sqrt(2*np.pi))
    g2 = pi2 * np.exp(-0.5*((mix_data-mu2)/s2)**2) / (s2*np.sqrt(2*np.pi))
    total = g1 + g2 + 1e-300
    r1 = g1 / total
    r2 = g2 / total
    # M-step
    n1_eff = np.sum(r1)
    n2_eff = np.sum(r2)
    mu1 = np.sum(r1*mix_data)/n1_eff
    mu2 = np.sum(r2*mix_data)/n2_eff
    s1 = np.sqrt(np.sum(r1*(mix_data-mu1)**2)/n1_eff)
    s2 = np.sqrt(np.sum(r2*(mix_data-mu2)**2)/n2_eff)
    pi1 = n1_eff/len(mix_data)
    pi2 = n2_eff/len(mix_data)
    ll = np.sum(np.log(total))
    ll_hist.append(ll)

ax4 = axes[1, 1]
ax4.plot(range(len(ll_hist)), ll_hist, color='#059669', linewidth=2)
ax4.set_xlabel('EM Iteration', fontsize=11)
ax4.set_ylabel('Log-Likelihood', fontsize=11)
ax4.set_title(f'EM Convergence (mu1={mu1:.2f}, mu2={mu2:.2f})', color='#6ee7b7', fontsize=13, fontweight='bold')

plt.tight_layout(pad=2.0)
plt.savefig('/tmp/mle_lab.png', dpi=130, bbox_inches='tight', facecolor='#0f172a')
plt.close()
print("Plot saved to /tmp/mle_lab.png")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

Practice Problems

Problem 1:Derive the MLE for both parameters $\mu$ and $\sigma^2$ of a normal distribution from a sample $x_1, \ldots, x_n$.

Solution:

1. Log-likelihood: $\ell(\mu, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (x_i - \mu)^2$.

2. Differentiate w.r.t. $\mu$: $\frac{\partial \ell}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^n (x_i - \mu) = 0$.

3. Solve: $\hat{\mu} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}$ (the sample mean).

4. Differentiate w.r.t. $\sigma^2$: $\frac{\partial \ell}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{2(\sigma^2)^2}\sum_{i=1}^n (x_i - \mu)^2 = 0$.

5. Solve: $\hat{\sigma}^2 = \frac{1}{n}\sum_{i=1}^n (x_i - \bar{x})^2$ (the biased sample variance, dividing by $n$ not $n-1$).

6. The MLE $\hat{\sigma}^2$ is biased: $E[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$. The unbiased estimator $s^2 = \frac{n}{n-1}\hat{\sigma}^2$ divides by $n-1$. Despite being biased, the MLE has smaller MSE for $n \geq 2$.

Problem 2:Derive the MLE for the rate parameter $\lambda$ of a Poisson distribution from observations $x_1, \ldots, x_n$.

Solution:

1. Poisson PMF: $P(X = x \mid \lambda) = \frac{\lambda^x e^{-\lambda}}{x!}$.

2. Log-likelihood: $\ell(\lambda) = \sum_{i=1}^n [x_i \ln\lambda - \lambda - \ln(x_i!)] = \ln\lambda \sum x_i - n\lambda - \sum\ln(x_i!)$.

3. Score: $\frac{d\ell}{d\lambda} = \frac{\sum x_i}{\lambda} - n = 0$.

4. Solve: $\hat{\lambda} = \frac{1}{n}\sum_{i=1}^n x_i = \bar{x}$ (the sample mean).

5. Verify it is a maximum: $\frac{d^2\ell}{d\lambda^2} = -\frac{\sum x_i}{\lambda^2} < 0$ for $\lambda > 0$.

6. The MLE is unbiased ($E[\bar{X}] = \lambda$) and equals the sufficient statistic. Its variance is $\text{Var}(\hat{\lambda}) = \lambda/n$, which achieves the Cramer-Rao bound since $I(\lambda) = 1/\lambda$ and $1/(nI) = \lambda/n$.

Problem 3:Derive the MLE for the rate parameter $\lambda$ of an exponential distribution $f(x) = \lambda e^{-\lambda x}$ for $x \geq 0$.

Solution:

1. Log-likelihood: $\ell(\lambda) = \sum_{i=1}^n \ln(\lambda e^{-\lambda x_i}) = n\ln\lambda - \lambda\sum_{i=1}^n x_i$.

2. Score: $\frac{d\ell}{d\lambda} = \frac{n}{\lambda} - \sum x_i = 0$.

3. Solve: $\hat{\lambda} = \frac{n}{\sum x_i} = \frac{1}{\bar{x}}$.

4. Second derivative: $\frac{d^2\ell}{d\lambda^2} = -\frac{n}{\lambda^2} < 0$. Confirmed maximum.

5. The MLE is biased: $E[1/\bar{X}] \neq \lambda$ (Jensen's inequality, since $1/x$ is convex). However, $\hat{\lambda}$ is consistent: $\hat{\lambda} \xrightarrow{P} \lambda$ as $n \to \infty$.

6. Fisher information: $I(\lambda) = 1/\lambda^2$. The Cramer-Rao bound is $\text{Var}(\hat{\lambda}) \geq \lambda^2/n$. The MLE achieves this bound asymptotically.

Problem 4:Compute the Fisher information $I(p)$ for a single observation from a Binomial($n, p$) distribution, and find the Cramer-Rao lower bound on the variance of any unbiased estimator of $p$.

Solution:

1. Binomial log-likelihood for one observation $X$: $\ell(p) = X\ln p + (n-X)\ln(1-p) + \ln\binom{n}{X}$.

2. Score: $S(p) = \frac{X}{p} - \frac{n-X}{1-p}$.

3. Fisher information: $I(p) = -E\!\left[\frac{d^2\ell}{dp^2}\right] = -E\!\left[-\frac{X}{p^2} - \frac{n-X}{(1-p)^2}\right]$.

4. Using $E[X] = np$: $I(p) = \frac{np}{p^2} + \frac{n - np}{(1-p)^2} = \frac{n}{p} + \frac{n}{1-p} = \frac{n}{p(1-p)}$.

5. Cramer-Rao bound: $\text{Var}(\hat{p}) \geq \frac{1}{I(p)} = \frac{p(1-p)}{n}$.

6. The MLE $\hat{p} = X/n$ has $\text{Var}(\hat{p}) = p(1-p)/n$, exactly matching the bound. So $\hat{p}$ is efficient (minimum variance unbiased) for all sample sizes, not just asymptotically. This is a special property of exponential family distributions.

Problem 5:For a Poisson distribution with parameter $\lambda$, find the Fisher information and use the Cramer-Rao bound to show that no unbiased estimator of $e^{-\lambda}$ (the probability of zero counts) can have variance less than $\frac{\lambda e^{-2\lambda}}{n}$.

Solution:

1. Fisher information for Poisson: $I(\lambda) = E\!\left[\left(\frac{X - \lambda}{\lambda}\right)^2\right] = \frac{\text{Var}(X)}{\lambda^2} = \frac{\lambda}{\lambda^2} = \frac{1}{\lambda}$.

2. For $n$ i.i.d. observations: $I_n(\lambda) = n/\lambda$.

3. To apply the Cramer-Rao bound for $g(\lambda) = e^{-\lambda}$, use the generalized form: $\text{Var}(T) \geq \frac{[g'(\lambda)]^2}{I_n(\lambda)}$.

4. $g'(\lambda) = -e^{-\lambda}$, so $[g'(\lambda)]^2 = e^{-2\lambda}$.

5. Cramer-Rao bound: $\text{Var}(T) \geq \frac{e^{-2\lambda}}{n/\lambda} = \frac{\lambda e^{-2\lambda}}{n}$.

6. The MLE of $e^{-\lambda}$ is $e^{-\hat{\lambda}} = e^{-\bar{X}}$ (by the invariance property). This is biased for finite $n$ but is asymptotically efficient, achieving the Cramer-Rao bound as $n \to \infty$. An exactly unbiased estimator is $T = (1 - 1/n)^{\sum X_i}$.

← Hypothesis Testing Regression Analysis →

Share:X Reddit LinkedIn