Part IV: Advanced Topics | Chapter 2

Multivariate Analysis

Joint distributions, multivariate inference, and dimensionality reduction

Historical Context

Multivariate analysis emerged from Francis Galton's and Karl Pearson's work on correlation and regression in the late 19th century. Pearson introduced the correlation matrix, while his student R. A. Fisher developed discriminant analysis in 1936 to classify Iris species from multiple measurements simultaneously. Harold Hotelling generalized Student's t-test to the multivariate setting in 1931 and introduced canonical correlation analysis.

The Wishart distribution, derived by John Wishart in 1928 as the multivariate generalization of the chi-squared distribution, became the cornerstone of multivariate hypothesis testing. S. S. Wilks, Pillai, Lawley, and Hotelling developed the four classical test statistics for MANOVA. Factor analysis originated with Charles Spearman's 1904 work on general intelligence, and was formalized by Thurstone and later placed on rigorous statistical footing by Lawley and Maxwell. Modern computational advances have made these methods practical for high-dimensional data analysis in genomics, neuroimaging, and finance.

2.1 The Multivariate Normal Distribution

The multivariate normal (MVN) distribution is the fundamental building block of multivariate analysis, serving as both a practical model and a theoretical reference distribution.

Definition: Multivariate Normal

A random vector $\mathbf{X} = (X_1, \ldots, X_p)' \sim N_p(\boldsymbol{\mu}, \boldsymbol{\Sigma})$has density:

$$f(\mathbf{x}) = \frac{1}{(2\pi)^{p/2}|\boldsymbol{\Sigma}|^{1/2}} \exp\!\left(-\frac{1}{2}(\mathbf{x} - \boldsymbol{\mu})'\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})\right)$$

where $\boldsymbol{\mu} \in \mathbb{R}^p$ is the mean vector and$\boldsymbol{\Sigma}$ is the $p \times p$ positive definite covariance matrix.

A key property of the MVN is closure under linear transformations: if$\mathbf{X} \sim N_p(\boldsymbol{\mu}, \boldsymbol{\Sigma})$ and$\mathbf{Y} = A\mathbf{X} + \mathbf{b}$, then$\mathbf{Y} \sim N_q(A\boldsymbol{\mu} + \mathbf{b}, A\boldsymbol{\Sigma}A')$.

Conditional Distributions

Partition $\mathbf{X} = (\mathbf{X}_1', \mathbf{X}_2')'$ with corresponding partitions of $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$. Then:

$$\mathbf{X}_1 \mid \mathbf{X}_2 = \mathbf{x}_2 \sim N\!\left(\boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}(\mathbf{x}_2 - \boldsymbol{\mu}_2),\; \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12}\boldsymbol{\Sigma}_{22}^{-1}\boldsymbol{\Sigma}_{21}\right)$$

The conditional mean is a linear function of $\mathbf{x}_2$, and the conditional covariance does not depend on the conditioning value. This property is the basis for multivariate regression and kriging.

For the MVN, uncorrelatedness implies independence: if$\boldsymbol{\Sigma}_{12} = \mathbf{0}$, then $\mathbf{X}_1$ and$\mathbf{X}_2$ are independent. The Mahalanobis distance$D^2 = (\mathbf{x} - \boldsymbol{\mu})'\boldsymbol{\Sigma}^{-1}(\mathbf{x} - \boldsymbol{\mu})$follows a $\chi^2_p$ distribution and provides the natural metric for measuring how far a point is from the center of the distribution.

2.2 The Wishart Distribution

Just as the chi-squared distribution arises from sums of squared normals, the Wishart distribution describes the sampling distribution of the scatter matrix from multivariate normal data.

Definition: Wishart Distribution

If $\mathbf{X}_1, \ldots, \mathbf{X}_n \overset{iid}{\sim} N_p(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, then the sample scatter matrix:

$$\mathbf{S} = \sum_{i=1}^{n} (\mathbf{X}_i - \bar{\mathbf{X}})(\mathbf{X}_i - \bar{\mathbf{X}})' \sim W_p(n-1, \boldsymbol{\Sigma})$$

where $W_p(m, \boldsymbol{\Sigma})$ denotes the Wishart distribution with$m$ degrees of freedom and scale matrix $\boldsymbol{\Sigma}$. The Wishart distribution exists when $m \geq p$.

Key properties include: $\mathbb{E}[\mathbf{S}] = m\boldsymbol{\Sigma}$, additivity ($W_p(m_1, \Sigma) + W_p(m_2, \Sigma) \sim W_p(m_1 + m_2, \Sigma)$for independent summands), and the Bartlett decomposition which represents a Wishart matrix as $\mathbf{S} = \mathbf{L}\mathbf{A}\mathbf{A}'\mathbf{L}'$ where$\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}'$ is the Cholesky factor and$\mathbf{A}$ is lower triangular with chi-squared diagonal entries and standard normal off-diagonal entries.

The inverse Wishart distribution, $\mathbf{S}^{-1} \sim W_p^{-1}(m, \boldsymbol{\Sigma}^{-1})$, serves as the conjugate prior for the covariance matrix in Bayesian multivariate analysis, with density:

$$f(\boldsymbol{\Psi}) \propto |\boldsymbol{\Psi}|^{-(m+p+1)/2} \exp\!\left(-\frac{1}{2}\text{tr}(\boldsymbol{\Lambda}\boldsymbol{\Psi}^{-1})\right)$$

2.3 Hotelling's T-Squared and MANOVA

Hotelling's $T^2$ generalizes the univariate t-test to the multivariate setting, testing whether a mean vector equals a hypothesized value.

Hotelling's T-Squared Test

For $H_0: \boldsymbol{\mu} = \boldsymbol{\mu}_0$ based on a sample of size $n$:

$$T^2 = n(\bar{\mathbf{X}} - \boldsymbol{\mu}_0)'\mathbf{S}^{-1}(\bar{\mathbf{X}} - \boldsymbol{\mu}_0)$$

Under $H_0$, $\frac{n-p}{p(n-1)} T^2 \sim F_{p, n-p}$. For two-sample problems, $T^2 = \frac{n_1 n_2}{n_1 + n_2}(\bar{\mathbf{X}}_1 - \bar{\mathbf{X}}_2)'\mathbf{S}_{\text{pool}}^{-1}(\bar{\mathbf{X}}_1 - \bar{\mathbf{X}}_2)$.

Multivariate Analysis of Variance (MANOVA) extends this to $k$groups. The total sum of squares and cross-products decomposes as$\mathbf{T} = \mathbf{B} + \mathbf{W}$, where $\mathbf{B}$ and$\mathbf{W}$ are the between-group and within-group matrices. Four classical test statistics exist:

Wilks' Lambda: $\Lambda = |\mathbf{W}|/|\mathbf{T}| = \prod_i (1 + \lambda_i)^{-1}$

Pillai's Trace: $V = \text{tr}(\mathbf{B}\mathbf{T}^{-1}) = \sum_i \lambda_i/(1+\lambda_i)$

Hotelling-Lawley Trace: $U = \text{tr}(\mathbf{B}\mathbf{W}^{-1}) = \sum_i \lambda_i$

Roy's Largest Root: $\theta = \lambda_1 / (1 + \lambda_1)$ where $\lambda_i$ are eigenvalues of $\mathbf{W}^{-1}\mathbf{B}$.

Pillai's trace is most robust to departures from assumptions (normality, equal covariance matrices), while Roy's largest root is most powerful when group differences are concentrated in a single dimension.

2.4 Discriminant Analysis

Discriminant analysis constructs classification rules from multivariate data. Under the assumption that each class $k$ has distribution$N_p(\boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$ with prior probability$\pi_k$, the Bayes-optimal classifier assigns observation $\mathbf{x}$to the class maximizing the posterior probability.

Linear Discriminant Analysis (LDA)

When all classes share a common covariance $\boldsymbol{\Sigma}$, the log-posterior is linear in $\mathbf{x}$. The discriminant function for class $k$ is:

$$\delta_k(\mathbf{x}) = \mathbf{x}'\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_k - \frac{1}{2}\boldsymbol{\mu}_k'\boldsymbol{\Sigma}^{-1}\boldsymbol{\mu}_k + \log \pi_k$$

Assign to $\arg\max_k \delta_k(\mathbf{x})$. The decision boundary between classes $j$ and $k$ is a hyperplane.

Quadratic Discriminant Analysis (QDA)

When covariance matrices differ across classes, the discriminant function becomes quadratic:

$$\delta_k(\mathbf{x}) = -\frac{1}{2}\log|\boldsymbol{\Sigma}_k| - \frac{1}{2}(\mathbf{x} - \boldsymbol{\mu}_k)'\boldsymbol{\Sigma}_k^{-1}(\mathbf{x} - \boldsymbol{\mu}_k) + \log \pi_k$$

QDA is more flexible but requires estimating $O(p^2)$ parameters per class, making it prone to overfitting in high dimensions.

Fisher's approach to LDA seeks directions $\mathbf{w}$ that maximize the ratio of between-class to within-class variance:$J(\mathbf{w}) = \mathbf{w}'\mathbf{B}\mathbf{w} / \mathbf{w}'\mathbf{W}\mathbf{w}$. The optimal $\mathbf{w}$ is the eigenvector of $\mathbf{W}^{-1}\mathbf{B}$corresponding to the largest eigenvalue. For $K$ classes, at most$\min(K-1, p)$ discriminant directions exist.

2.5 Factor Analysis

Factor analysis models observed variables as linear combinations of a smaller number of unobserved latent factors, explaining correlations among observed variables through shared underlying causes.

The Factor Model

The $m$-factor model for a $p$-dimensional observation is:

$$\mathbf{X} - \boldsymbol{\mu} = \mathbf{L}\mathbf{f} + \boldsymbol{\epsilon}$$

where $\mathbf{L}$ is the $p \times m$ factor loading matrix, $\mathbf{f} \sim N_m(\mathbf{0}, \mathbf{I})$ are the common factors, and $\boldsymbol{\epsilon} \sim N_p(\mathbf{0}, \boldsymbol{\Psi})$are the specific factors with $\boldsymbol{\Psi} = \text{diag}(\psi_1, \ldots, \psi_p)$. This implies:

$$\boldsymbol{\Sigma} = \mathbf{L}\mathbf{L}' + \boldsymbol{\Psi}$$

The communality of variable $j$ is$h_j^2 = \sum_{k=1}^m l_{jk}^2$, representing the proportion of variance explained by the common factors. The uniqueness is $\psi_j = \sigma_{jj} - h_j^2$.

Estimation proceeds by maximum likelihood or principal factor methods. The loading matrix is not uniquely determined: for any orthogonal matrix $\mathbf{T}$,$\mathbf{L}^* = \mathbf{L}\mathbf{T}$ gives the same covariance structure. Factor rotation (varimax, promax) exploits this invariance to obtain interpretable loadings. The number of factors is chosen by scree plots, parallel analysis, or likelihood ratio tests comparing $m$ vs. $m+1$ factors.

2.6 Computational Lab

We sample from multivariate normals, estimate covariance matrices, perform linear discriminant analysis, and demonstrate factor analysis with rotation.

Multivariate Analysis: MVN, LDA, and Factor Analysis

Python

script.py171 lines

import numpy as np
np.random.seed(42)

# =============================================
# 1. Multivariate Normal Sampling and Properties
# =============================================
print("=== Multivariate Normal Distribution ===")
p = 3
mu = np.array([1.0, 2.0, 3.0])
Sigma = np.array([[2.0, 0.8, 0.5],
                  [0.8, 1.5, 0.3],
                  [0.5, 0.3, 1.0]])

# Sample via Cholesky decomposition
L = np.linalg.cholesky(Sigma)
n = 5000
Z = np.random.normal(size=(p, n))
X = (L @ Z).T + mu

print("True mean:    ", mu)
print("Sample mean:  ", np.round(np.mean(X, axis=0), 4))
print("")
print("True covariance:")
for row in Sigma:
    print("  ", ["%.4f" % v for v in row])
print("Sample covariance:")
S = np.cov(X.T)
for row in S:
    print("  ", ["%.4f" % v for v in row])

# Conditional distribution: X1 | X2=x2, X3=x3
print("")
print("=== Conditional Distribution X1 | X2, X3 ===")
mu1, mu2 = mu[:1], mu[1:]
S11 = Sigma[:1, :1]
S12 = Sigma[:1, 1:]
S22 = Sigma[1:, 1:]
S22_inv = np.linalg.inv(S22)

x2_given = np.array([2.5, 3.5])
cond_mean = mu1 + S12 @ S22_inv @ (x2_given - mu2)
cond_var = S11 - S12 @ S22_inv @ S12.T
print("Given X2=2.5, X3=3.5:")
print("  Conditional mean of X1: %.4f" % cond_mean[0])
print("  Conditional variance:   %.4f" % cond_var[0, 0])

# Verify by subsetting samples
mask = (np.abs(X[:, 1] - 2.5) < 0.2) & (np.abs(X[:, 2] - 3.5) < 0.2)
if np.sum(mask) > 10:
    print("  Empirical cond. mean:   %.4f (from %d samples)" % (np.mean(X[mask, 0]), np.sum(mask)))

# Mahalanobis distance
D2 = np.sum((X - mu) @ np.linalg.inv(Sigma) * (X - mu), axis=1)
print("")
print("=== Mahalanobis Distance (should be chi-sq with df=%d) ===" % p)
print("  Empirical mean:      %.4f (expected: %.1f)" % (np.mean(D2), p))
print("  Empirical variance:  %.4f (expected: %.1f)" % (np.var(D2), 2*p))
print("  Fraction < chi2(.95): %.4f (expected: 0.95)" % (np.mean(D2 < 7.815)))

# =============================================
# 2. Linear Discriminant Analysis
# =============================================
print("")
print("=== Linear Discriminant Analysis (2 classes) ===")

# Generate two classes
n1, n2 = 200, 200
mu1_cls = np.array([0.0, 0.0])
mu2_cls = np.array([2.0, 1.0])
Sigma_cls = np.array([[1.5, 0.5], [0.5, 1.0]])
L_cls = np.linalg.cholesky(Sigma_cls)

X1 = (L_cls @ np.random.normal(size=(2, n1))).T + mu1_cls
X2 = (L_cls @ np.random.normal(size=(2, n2))).T + mu2_cls

# Pooled covariance
S_pool = ((n1-1)*np.cov(X1.T) + (n2-1)*np.cov(X2.T)) / (n1+n2-2)
S_pool_inv = np.linalg.inv(S_pool)
xbar1 = np.mean(X1, axis=0)
xbar2 = np.mean(X2, axis=0)

# Fisher discriminant direction
w = S_pool_inv @ (xbar1 - xbar2)
w = w / np.linalg.norm(w)
print("Fisher discriminant direction: [%.4f, %.4f]" % (w[0], w[1]))

# Classification on training data
def lda_classify(x, m1, m2, Sinv, pi1=0.5, pi2=0.5):
    d1 = x @ Sinv @ m1 - 0.5 * m1 @ Sinv @ m1 + np.log(pi1)
    d2 = x @ Sinv @ m2 - 0.5 * m2 @ Sinv @ m2 + np.log(pi2)
    return 1 if d1 > d2 else 2

correct = 0
for x in X1:
    if lda_classify(x, xbar1, xbar2, S_pool_inv) == 1:
        correct += 1
for x in X2:
    if lda_classify(x, xbar1, xbar2, S_pool_inv) == 2:
        correct += 1

accuracy = correct / (n1 + n2)
print("Training accuracy: %.4f" % accuracy)

# Theoretical Bayes error
delta = (xbar1 - xbar2) @ S_pool_inv @ (xbar1 - xbar2)
print("Mahalanobis distance between classes: %.4f" % np.sqrt(delta))

# =============================================
# 3. Factor Analysis (2 factors from 5 variables)
# =============================================
print("")
print("=== Factor Analysis ===")
# True model: X = L*f + epsilon
p_fa, m_fa = 5, 2
L_true = np.array([[0.9, 0.1],
                    [0.8, 0.2],
                    [0.7, 0.3],
                    [0.1, 0.9],
                    [0.2, 0.8]])
Psi_true = np.diag([0.18, 0.32, 0.42, 0.18, 0.32])
Sigma_fa = L_true @ L_true.T + Psi_true

# Generate data
n_fa = 1000
L_chol = np.linalg.cholesky(Sigma_fa)
X_fa = (L_chol @ np.random.normal(size=(p_fa, n_fa))).T

# Estimate via eigendecomposition of correlation matrix
R = np.corrcoef(X_fa.T)
eigvals, eigvecs = np.linalg.eigh(R)
idx = np.argsort(eigvals)[::-1]
eigvals = eigvals[idx]
eigvecs = eigvecs[:, idx]

print("Eigenvalues of correlation matrix:")
for i, ev in enumerate(eigvals):
    print("  Factor %d: %.4f (%.1f%% variance)" % (i+1, ev, 100*ev/sum(eigvals)))

# Extract 2 factors (principal factor method)
L_est = eigvecs[:, :m_fa] * np.sqrt(eigvals[:m_fa])

# Varimax rotation
def varimax(L, max_iter=100, tol=1e-6):
    p, m = L.shape
    T = np.eye(m)
    for _ in range(max_iter):
        B = L @ T
        # Varimax criterion gradient
        u, s, vt = np.linalg.svd(L.T @ (B**3 - B @ np.diag(np.sum(B**2, axis=0)) / p))
        T_new = u @ vt
        if np.max(np.abs(T_new - T)) < tol:
            break
        T = T_new
    return L @ T, T

L_rot, T_rot = varimax(L_est)

print("")
print("Rotated factor loadings:")
print("  Var   Factor1   Factor2")
for i in range(p_fa):
    print("   X%d   %7.4f   %7.4f" % (i+1, L_rot[i, 0], L_rot[i, 1]))

# Communalities
h2 = np.sum(L_rot**2, axis=1)
print("")
print("Communalities:")
for i in range(p_fa):
    print("  X%d: %.4f" % (i+1, h2[i]))
print("Total variance explained: %.4f / %d = %.1f%%" % (np.sum(h2), p_fa, 100*np.sum(h2)/p_fa))

Click Run to execute the Python code

Code will be executed with Python 3 on the server

2.7 Summary and Key Takeaways

Multivariate Normal

The MVN is characterized by its mean vector and covariance matrix. Conditional distributions are normal with linear conditional means, and uncorrelated components are independent.

Wishart Distribution

The Wishart distribution is the sampling distribution of the scatter matrix from MVN data, serving as the foundation of multivariate hypothesis testing.

Hotelling's T-Squared and MANOVA

$T^2$ generalizes the t-test to vector means. MANOVA decomposes multivariate variability into between- and within-group components with four classical test statistics.

Discriminant Analysis

LDA assumes equal covariance matrices yielding linear boundaries; QDA allows unequal covariances yielding quadratic boundaries. Fisher's criterion maximizes class separation.

Factor Analysis

Factor analysis explains correlations through latent factors. The loading matrix is rotationally non-unique; varimax rotation enhances interpretability by seeking simple structure.

← Time Series Nonparametric Methods →

Share:X Reddit LinkedIn