Part IV: Advanced Topics | Chapter 3

Nonparametric Methods

Distribution-free inference, resampling methods, and flexible estimation

Historical Context

Nonparametric statistics arose from dissatisfaction with the strong distributional assumptions required by classical parametric methods. Frank Wilcoxon introduced his signed-rank test in 1945, and Henry Mann and Donald Whitney developed the rank-sum test independently in 1947, providing distribution-free alternatives to the t-test. These rank-based methods drew on the insight that the ranks of observations carry substantial information about location differences while being invariant to the underlying distribution.

Kernel density estimation was formalized by Murray Rosenblatt in 1956 and Emanuel Parzen in 1962, providing a smooth nonparametric alternative to histograms. The bootstrap, introduced by Bradley Efron in 1979, revolutionized statistical inference by showing that resampling from the data itself could approximate the sampling distribution of virtually any statistic. Permutation tests, rooted in Fisher's exact test from the 1930s, were made practical by modern computing. Together, these methods provide a flexible toolkit that makes minimal assumptions about the data-generating process.

3.1 Order Statistics and Rank Tests

Given a sample $X_1, \ldots, X_n$, the order statistics$X_{(1)} \leq X_{(2)} \leq \cdots \leq X_{(n)}$ form the sorted sample. The rank of $X_i$ is $R_i = \#\{j : X_j \leq X_i\}$. Under the null hypothesis of exchangeability, all $n!$ permutations of ranks are equally likely, which provides exact null distributions for rank-based tests.

Wilcoxon Signed-Rank Test

For paired differences $D_i = X_i - Y_i$, testing $H_0$: the distribution of $D_i$ is symmetric about zero. Compute:

$$W^+ = \sum_{i=1}^n \text{sign}(D_i) \cdot R_i^*$$

where $R_i^*$ is the rank of $|D_i|$ among$|D_1|, \ldots, |D_n|$. Under $H_0$,$\mathbb{E}[W^+] = n(n+1)/4$ and$\text{Var}(W^+) = n(n+1)(2n+1)/24$.

Mann-Whitney U Test (Wilcoxon Rank-Sum)

For two independent samples of sizes $m$ and $n$, the test statistic is based on the sum of ranks of the first sample in the combined ranking:

$$U = W - \frac{m(m+1)}{2}, \qquad W = \sum_{i=1}^m R_i$$

$U$ counts the number of pairs $(X_i, Y_j)$ where$X_i > Y_j$. Under $H_0$,$\mathbb{E}[U] = mn/2$. The Mann-Whitney test is the most powerful rank test for detecting location shifts and has asymptotic relative efficiency$3/\pi \approx 0.955$ relative to the t-test under normality, and can be much more efficient under heavy-tailed distributions.

3.2 Kernel Density Estimation

The kernel density estimator (KDE) provides a smooth, nonparametric estimate of a probability density function from a random sample.

Definition: Kernel Density Estimator

Given data $X_1, \ldots, X_n$, the KDE with kernel $K$and bandwidth $h > 0$ is:

$$\hat{f}_h(x) = \frac{1}{nh} \sum_{i=1}^n K\!\left(\frac{x - X_i}{h}\right)$$

where $K$ is a non-negative function integrating to 1 (typically the Gaussian kernel $K(u) = (2\pi)^{-1/2}e^{-u^2/2}$).

The bandwidth $h$ controls the bias-variance tradeoff. The mean integrated squared error (MISE) decomposes as:

$$\text{MISE}(\hat{f}_h) = \int \text{Bias}^2[\hat{f}_h(x)] \, dx + \int \text{Var}[\hat{f}_h(x)] \, dx$$

The asymptotically optimal bandwidth minimizing MISE is:

$$h_{\text{opt}} = \left(\frac{R(K)}{n \mu_2(K)^2 R(f'')}\right)^{1/5} = O(n^{-1/5})$$

where $R(g) = \int g^2$ and $\mu_2(K) = \int u^2 K(u) \, du$. Silverman's rule of thumb uses a Gaussian reference:$h = 1.06 \hat{\sigma} n^{-1/5}$. Cross-validation methods (least-squares or likelihood) provide data-driven bandwidth selection without reference distribution assumptions.

3.3 Bootstrap Methods

The bootstrap principle approximates the sampling distribution of a statistic$T_n = t(X_1, \ldots, X_n)$ by resampling from the empirical distribution$\hat{F}_n$.

The Bootstrap Algorithm

1. Draw $X_1^*, \ldots, X_n^*$ iid from$\hat{F}_n$ (sample with replacement).

2. Compute $T_n^* = t(X_1^*, \ldots, X_n^*)$.

3. Repeat $B$ times to obtain$T_n^{*(1)}, \ldots, T_n^{*(B)}$. The empirical distribution of$T_n^{*(b)} - T_n$ approximates the distribution of $T_n - \theta$.

The percentile method uses quantiles of $T_n^{*(b)}$ directly as confidence limits. The basic bootstrap inverts the percentile method using $2T_n - q^*_{1-\alpha/2}$. The most refined approach is the BCa (bias-corrected and accelerated) method:

BCa Confidence Interval

The BCa interval adjusts the percentile boundaries using a bias correction$\hat{z}_0$ and an acceleration constant $\hat{a}$:

$$\alpha_1 = \Phi\!\left(\hat{z}_0 + \frac{\hat{z}_0 + z_{\alpha/2}}{1 - \hat{a}(\hat{z}_0 + z_{\alpha/2})}\right), \quad \alpha_2 = \Phi\!\left(\hat{z}_0 + \frac{\hat{z}_0 + z_{1-\alpha/2}}{1 - \hat{a}(\hat{z}_0 + z_{1-\alpha/2})}\right)$$

where $\hat{z}_0 = \Phi^{-1}(\#\{T_n^{*(b)} < T_n\}/B)$ corrects for median bias, and $\hat{a} = \sum_i \ell_i^3 / (6(\sum_i \ell_i^2)^{3/2})$uses jackknife influence values $\ell_i$ to correct for skewness. BCa intervals are second-order accurate and transformation-respecting.

3.4 Permutation Tests

Permutation tests derive exact (or approximate via Monte Carlo) p-values by exploiting the exchangeability of observations under the null hypothesis. Unlike rank tests, they can use any test statistic.

Permutation Test Framework

Given two samples $\mathbf{x} = (x_1, \ldots, x_m)$ and$\mathbf{y} = (y_1, \ldots, y_n)$, testing $H_0$: the two samples come from the same distribution.

1. Compute the observed statistic $T_{\text{obs}} = T(\mathbf{x}, \mathbf{y})$(e.g., difference in means).

2. Pool all $N = m + n$ observations and consider all $\binom{N}{m}$ways to partition them into groups of size $m$ and $n$.

3. The p-value is $p = P(T \geq T_{\text{obs}} \mid H_0) = \#\{T_\pi \geq T_{\text{obs}}\} / \binom{N}{m}$.

When $\binom{N}{m}$ is too large for exact enumeration, a Monte Carlo approximation draws $B$ random permutations and estimates$\hat{p} = (\#\{T_\pi^{(b)} \geq T_{\text{obs}}\} + 1) / (B + 1)$. The $+1$ in numerator and denominator ensures the p-value is never zero and accounts for the identity permutation. Permutation tests control the Type I error exactly at level $\alpha$ under exchangeability, without any distributional assumptions. They generalize naturally to multivariate settings, correlation tests, and complex experimental designs.

3.5 Nonparametric Regression

Nonparametric regression estimates the function $m(x) = \mathbb{E}[Y \mid X = x]$without imposing a parametric form. The Nadaraya-Watson kernel regression estimator is:

$$\hat{m}(x) = \frac{\sum_{i=1}^n K_h(x - X_i) Y_i}{\sum_{i=1}^n K_h(x - X_i)}$$

This is a weighted average of $Y_i$ values, with weights proportional to the kernel evaluated at the distance from $x$ to $X_i$. Local polynomial regression generalizes this by fitting a polynomial in a neighborhood of each prediction point, reducing boundary bias.

Regression Splines

A cubic spline with knots $\xi_1 < \cdots < \xi_K$ is a piecewise cubic polynomial that is continuous with continuous first and second derivatives at each knot. The truncated power basis representation is:

$$s(x) = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 x^3 + \sum_{k=1}^K \beta_{3+k}(x - \xi_k)_+^3$$

where $(x - \xi_k)_+ = \max(0, x - \xi_k)$. The smoothing splineminimizes $\sum_{i=1}^n (Y_i - s(X_i))^2 + \lambda \int s''(x)^2 \, dx$, where $\lambda \geq 0$ controls the smoothness. The solution is a natural cubic spline with knots at all data points, and the effective degrees of freedom$\text{df}(\lambda) = \text{tr}(\mathbf{S}_\lambda)$ decreases from$n$ to 2 as $\lambda$ increases from 0 to $\infty$.

3.6 Computational Lab

We implement kernel density estimation with multiple bandwidths, bootstrap confidence intervals, a permutation test for two samples, and kernel regression.

Nonparametric Methods: KDE, Bootstrap, Permutation Tests, and Kernel Regression

Python

script.py170 lines

import numpy as np
np.random.seed(42)

# =============================================
# 1. Kernel Density Estimation
# =============================================
print("=== Kernel Density Estimation ===")

# Generate mixture of two normals
n = 500
mix = np.random.binomial(1, 0.4, n)
data = mix * np.random.normal(2, 0.5, n) + (1 - mix) * np.random.normal(-1, 1.0, n)

# Gaussian kernel
def kde(x, data, h):
    n = len(data)
    return np.mean(np.exp(-0.5 * ((x - data) / h)**2) / (h * np.sqrt(2 * np.pi)))

# True density
def true_density(x):
    return 0.4 * np.exp(-0.5*((x-2)/0.5)**2) / (0.5*np.sqrt(2*np.pi)) + \
           0.6 * np.exp(-0.5*((x+1)/1.0)**2) / (1.0*np.sqrt(2*np.pi))

# Silverman bandwidth
sigma_hat = np.std(data)
h_silv = 1.06 * sigma_hat * n**(-0.2)
print("Silverman bandwidth: %.4f" % h_silv)

# Evaluate at grid points
x_grid = np.linspace(-5, 5, 21)
bandwidths = [0.1, 0.3, h_silv, 1.0]

print("")
print("x       True     h=0.10   h=0.30   h=Silv   h=1.00")
for x in x_grid[::2]:
    vals = [kde(x, data, h) for h in bandwidths]
    print("%5.2f   %.4f   %.4f   %.4f   %.4f   %.4f" % (x, true_density(x), vals[0], vals[1], vals[2], vals[3]))

# MISE approximation
def mise_approx(data, h, x_eval):
    true_vals = np.array([true_density(x) for x in x_eval])
    kde_vals = np.array([kde(x, data, h) for x in x_eval])
    dx = x_eval[1] - x_eval[0]
    return np.sum((kde_vals - true_vals)**2) * dx

x_fine = np.linspace(-5, 5, 200)
print("")
print("Bandwidth   Approx MISE")
for h in [0.1, 0.2, 0.3, h_silv, 0.5, 0.8, 1.0]:
    m = mise_approx(data, h, x_fine)
    print("  %.3f      %.6f" % (h, m))

# =============================================
# 2. Bootstrap Confidence Interval for Median
# =============================================
print("")
print("=== Bootstrap CI for Median ===")
sample = np.random.exponential(2, 80)
theta_hat = np.median(sample)
B = 5000
boot_medians = np.zeros(B)
for b in range(B):
    boot_sample = np.random.choice(sample, size=len(sample), replace=True)
    boot_medians[b] = np.median(boot_sample)

# Percentile interval
alpha = 0.05
ci_perc = (np.percentile(boot_medians, 100*alpha/2),
           np.percentile(boot_medians, 100*(1-alpha/2)))

# Basic bootstrap interval
ci_basic = (2*theta_hat - np.percentile(boot_medians, 100*(1-alpha/2)),
            2*theta_hat - np.percentile(boot_medians, 100*alpha/2))

# BCa interval
from scipy.stats import norm as norm_dist

z0_hat = norm_dist.ppf(np.mean(boot_medians < theta_hat))
# Jackknife influence values
n_jk = len(sample)
jk_vals = np.zeros(n_jk)
for i in range(n_jk):
    jk_sample = np.delete(sample, i)
    jk_vals[i] = np.median(jk_sample)
jk_mean = np.mean(jk_vals)
ell = jk_mean - jk_vals
a_hat = np.sum(ell**3) / (6 * np.sum(ell**2)**1.5)

def bca_quantile(alpha_level):
    z_alpha = norm_dist.ppf(alpha_level)
    numer = z0_hat + z_alpha
    adj = z0_hat + numer / (1 - a_hat * numer)
    return norm_dist.cdf(adj)

alpha1 = bca_quantile(alpha/2)
alpha2 = bca_quantile(1 - alpha/2)
ci_bca = (np.percentile(boot_medians, 100*alpha1),
          np.percentile(boot_medians, 100*alpha2))

print("Sample median: %.4f" % theta_hat)
print("True median (Exp(2)): %.4f" % (2 * np.log(2)))
print("Bootstrap SE:  %.4f" % np.std(boot_medians))
print("")
print("95%% Confidence Intervals:")
print("  Percentile:  (%.4f, %.4f)" % ci_perc)
print("  Basic:       (%.4f, %.4f)" % ci_basic)
print("  BCa:         (%.4f, %.4f)" % ci_bca)
print("  BCa params:  z0=%.4f, a=%.4f" % (z0_hat, a_hat))

# =============================================
# 3. Permutation Test
# =============================================
print("")
print("=== Permutation Test: Two-Sample Means ===")
x_group = np.random.normal(0.5, 1.0, 30)
y_group = np.random.normal(0.0, 1.0, 35)
t_obs = np.mean(x_group) - np.mean(y_group)

pooled = np.concatenate([x_group, y_group])
m_perm = len(x_group)
n_perm = 10000
count_extreme = 0
perm_stats = np.zeros(n_perm)
for b in range(n_perm):
    perm = np.random.permutation(pooled)
    perm_stats[b] = np.mean(perm[:m_perm]) - np.mean(perm[m_perm:])
    if abs(perm_stats[b]) >= abs(t_obs):
        count_extreme += 1

p_perm = (count_extreme + 1) / (n_perm + 1)
print("Observed difference in means: %.4f" % t_obs)
print("Permutation p-value (two-sided): %.4f" % p_perm)
print("Permutation distribution: mean=%.4f, sd=%.4f" % (np.mean(perm_stats), np.std(perm_stats)))

# Compare with t-test
from scipy.stats import ttest_ind
t_stat, p_ttest = ttest_ind(x_group, y_group)
print("t-test p-value: %.4f" % p_ttest)

# =============================================
# 4. Kernel Regression (Nadaraya-Watson)
# =============================================
print("")
print("=== Kernel Regression ===")
n_reg = 200
X_reg = np.sort(np.random.uniform(0, 2*np.pi, n_reg))
Y_reg = np.sin(X_reg) + 0.3 * np.random.normal(size=n_reg)

def nw_kernel_reg(x, X, Y, h):
    weights = np.exp(-0.5 * ((x - X) / h)**2)
    if np.sum(weights) < 1e-10:
        return 0.0
    return np.sum(weights * Y) / np.sum(weights)

x_pred = np.linspace(0.2, 2*np.pi - 0.2, 15)
bandwidths_reg = [0.1, 0.3, 0.5, 1.0]

print("x       True     h=0.10   h=0.30   h=0.50   h=1.00")
for x in x_pred:
    preds = [nw_kernel_reg(x, X_reg, Y_reg, h) for h in bandwidths_reg]
    print("%5.2f   %7.4f  %7.4f  %7.4f  %7.4f  %7.4f" % (x, np.sin(x), preds[0], preds[1], preds[2], preds[3]))

# MSE for each bandwidth
print("")
print("Bandwidth   MSE")
for h in bandwidths_reg:
    preds_all = np.array([nw_kernel_reg(x, X_reg, Y_reg, h) for x in X_reg])
    mse = np.mean((preds_all - np.sin(X_reg))**2)
    print("  %.2f       %.6f" % (h, mse))

Click Run to execute the Python code

Code will be executed with Python 3 on the server

3.7 Summary and Key Takeaways

Rank Tests

The Wilcoxon and Mann-Whitney tests provide distribution-free inference for location differences, with nearly the efficiency of parametric tests under normality and superior efficiency under heavy tails.

Kernel Density Estimation

KDE smoothly estimates densities with the bandwidth controlling the bias-variance tradeoff. The optimal rate is $O(n^{-4/5})$ for MISE, achieved by Silverman's rule or cross-validation.

Bootstrap

Resampling from the empirical distribution approximates the sampling distribution of any statistic. BCa intervals provide second-order accurate, transformation-respecting confidence intervals.

Permutation Tests

Exact p-values under exchangeability, using any test statistic. Monte Carlo approximation makes them practical for any sample size.

Nonparametric Regression

Kernel regression and splines estimate the conditional mean without parametric assumptions, with smoothing parameters controlling complexity.

← Multivariate Analysis Causal Inference →

Share:X Reddit LinkedIn