Part II: Inference | Chapter 1

Expectation & Variance

Moments, moment generating functions, covariance, and conditional expectation

Historical Context

The concept of mathematical expectation traces back to Christiaan Huygens, who introduced the notion of “expected value” in his 1657 treatise De Ratiociniis in Ludo Aleae. The idea of variance as a measure of dispersion was formalized much later by Ronald Fisher in 1918, who also introduced the term “variance” itself. Pafnuty Chebyshev established his celebrated inequality in 1867, providing a universal bound on tail probabilities using only the first two moments. The moment generating function technique was developed by Pierre-Simon Laplace, while the tower property of conditional expectation became a cornerstone of modern probability theory through the work of Andrey Kolmogorov and Joseph Doob in the 1930s and 1940s.

1.1 Expectation

The expectation (or expected value) of a random variable provides a single number summarizing the “center” of its distribution, weighted by probability.

Discrete Case

If $X$ takes countably many values $x_1, x_2, \ldots$ with PMF $p(x_i) = P(X = x_i)$, then

$$E[X] = \sum_{i} x_i \, p(x_i)$$

provided the sum converges absolutely. If $\sum_i |x_i| p(x_i) = \infty$, the expectation does not exist (e.g., the Cauchy distribution).

Continuous Case

If $X$ has PDF $f(x)$, then

$$E[X] = \int_{-\infty}^{\infty} x \, f(x) \, dx$$

Again, existence requires absolute integrability: $\int |x| f(x) \, dx < \infty$.

Linearity of Expectation

For any random variables $X, Y$ and constants $a, b$:

$$E[aX + bY] = aE[X] + bE[Y]$$

This holds regardless of whether $X$ and $Y$ are independent. Linearity extends to any finite (or countable, under absolute convergence) linear combination.

Law of the Unconscious Statistician (LOTUS)

To compute $E[g(X)]$ one does not need the distribution of $g(X)$:

$$E[g(X)] = \begin{cases} \sum_x g(x) p(x) & \text{(discrete)} \\ \int_{-\infty}^{\infty} g(x) f(x) \, dx & \text{(continuous)} \end{cases}$$

LOTUS is essential for computing moments: set $g(x) = x^k$ to get $E[X^k]$, the $k$-th moment about the origin.

1.2 Variance and Chebyshev's Inequality

Definition and Properties

The variance measures the average squared deviation from the mean:

$$\text{Var}(X) = E[(X - E[X])^2] = E[X^2] - (E[X])^2$$

Key properties:

$\text{Var}(X) \ge 0$ with equality iff $X$ is a.s. constant.
$\text{Var}(aX + b) = a^2 \text{Var}(X)$ for constants $a, b$.
If $X, Y$ are independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$.
The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$, in the same units as $X$.

Chebyshev's Inequality

For any random variable $X$ with finite mean $\mu$ and variance $\sigma^2$, and any $k > 0$:

$$P(|X - \mu| \ge k\sigma) \le \frac{1}{k^2}$$

Equivalently, $P(|X - \mu| \ge t) \le \sigma^2 / t^2$. This bound is distribution-free and remarkably general, though often conservative. It is tight for a two-point distribution that places mass $1/(2k^2)$ at $\mu \pm k\sigma$and remaining mass at $\mu$.

1.3 Higher Moments and Moment Generating Functions

Moment Generating Function (MGF)

The MGF of $X$ is defined as

$$M_X(t) = E[e^{tX}] = \sum_{k=0}^{\infty} \frac{E[X^k]}{k!} t^k$$

provided the expectation exists in a neighborhood of $t = 0$. Moments are extracted via $E[X^k] = M_X^{(k)}(0)$. The MGF uniquely determines the distribution (when it exists). If $X \perp Y$, then $M_{X+Y}(t) = M_X(t) M_Y(t)$.

Skewness and Kurtosis

The skewness measures asymmetry and the kurtosis measures tail heaviness:

$$\text{Skew}(X) = E\!\left[\left(\frac{X - \mu}{\sigma}\right)^3\right] = \frac{\mu_3}{\sigma^3}$$

$$\text{Kurt}(X) = E\!\left[\left(\frac{X - \mu}{\sigma}\right)^4\right] = \frac{\mu_4}{\sigma^4}$$

The normal distribution has skewness 0 and kurtosis 3. The excess kurtosis is$\text{Kurt}(X) - 3$, so that the normal has excess kurtosis 0. Distributions with positive excess kurtosis (“leptokurtic”) have heavier tails than the normal.

1.4 Covariance and Correlation

Covariance

The covariance of $X$ and $Y$ is

$$\text{Cov}(X, Y) = E[(X - \mu_X)(Y - \mu_Y)] = E[XY] - E[X]E[Y]$$

Key properties:

$\text{Cov}(X, X) = \text{Var}(X)$
$\text{Cov}(X, Y) = \text{Cov}(Y, X)$ (symmetry)
$\text{Cov}(aX + b, cY + d) = ac \, \text{Cov}(X, Y)$ (bilinearity)
$\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$
Independence implies $\text{Cov}(X, Y) = 0$, but the converse is false.

Pearson Correlation Coefficient

The correlation normalizes covariance to the range $[-1, 1]$:

$$\rho(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}$$

$|\rho| = 1$ iff $Y = aX + b$ a.s. for some constants with $a \ne 0$. Correlation measures linear dependence only; uncorrelated random variables can still be strongly dependent (e.g., $X \sim N(0,1)$and $Y = X^2$ are uncorrelated but not independent).

1.5 Conditional Expectation

Definition

The conditional expectation of $X$ given $Y = y$ is

$$E[X \mid Y = y] = \begin{cases} \sum_x x \, p_{X|Y}(x \mid y) & \text{(discrete)} \\ \int x \, f_{X|Y}(x \mid y) \, dx & \text{(continuous)} \end{cases}$$

As a function of $Y$, $E[X \mid Y]$ is itself a random variable, representing the best prediction of $X$ given $Y$ in the mean-squared-error sense.

Law of Total Expectation (Tower Property)

The tower property (also called the law of iterated expectations) states:

$$E[X] = E[E[X \mid Y]]$$

More generally, for $\sigma$-algebras $\mathcal{G} \subseteq \mathcal{H}$:$E[E[X \mid \mathcal{H}] \mid \mathcal{G}] = E[X \mid \mathcal{G}]$.

There is also a law of total variance:

$$\text{Var}(X) = E[\text{Var}(X \mid Y)] + \text{Var}(E[X \mid Y])$$

This decomposes total variability into the average “within-group” variance and the “between-group” variance.

Computational Laboratory

We compute moments, verify Chebyshev's inequality empirically, examine covariance structure, and illustrate conditional expectation via simulation.

Moments, Chebyshev, Covariance & Conditional Expectation

Python

script.py94 lines

import numpy as np
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt

np.random.seed(42)

fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.patch.set_facecolor('#0f172a')
for ax in axes.flat:
    ax.set_facecolor('#0f172a')
    ax.tick_params(colors='#d1d5db')
    ax.xaxis.label.set_color('#d1d5db')
    ax.yaxis.label.set_color('#d1d5db')
    for spine in ax.spines.values():
        spine.set_color('#10b981')

# --- Panel 1: Moments of Gamma distribution ---
from scipy import stats
alphas = [1, 2, 5, 10]
x = np.linspace(0, 25, 500)
ax1 = axes[0, 0]
for a in alphas:
    rv = stats.gamma(a, scale=1)
    ax1.plot(x, rv.pdf(x), linewidth=2,
             label=f'alpha={a}, E={rv.mean():.1f}, Var={rv.var():.1f}')
    ax1.axvline(rv.mean(), linestyle='--', alpha=0.4, color='#34d399')
ax1.set_title('Gamma Distribution Moments', color='#10b981', fontsize=13, fontweight='bold')
ax1.set_xlabel('x')
ax1.set_ylabel('PDF')
ax1.legend(fontsize=8, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

# --- Panel 2: Chebyshev verification ---
ax2 = axes[0, 1]
N = 100000
samples = np.random.exponential(scale=2, size=N)
mu, sigma = np.mean(samples), np.std(samples)
ks = np.linspace(1, 5, 50)
empirical = [np.mean(np.abs(samples - mu) >= k * sigma) for k in ks]
chebyshev = 1.0 / ks**2
ax2.plot(ks, empirical, color='#10b981', linewidth=2, label='Empirical P(|X-mu|>=k*sigma)')
ax2.plot(ks, chebyshev, color='#f87171', linewidth=2, linestyle='--', label='Chebyshev bound 1/k^2')
ax2.set_title("Chebyshev's Inequality Verification", color='#10b981', fontsize=13, fontweight='bold')
ax2.set_xlabel('k')
ax2.set_ylabel('Probability')
ax2.legend(fontsize=9, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

# --- Panel 3: Covariance heatmap ---
ax3 = axes[1, 0]
Z = np.random.randn(N, 3)
X1 = Z[:, 0]
X2 = 0.7 * Z[:, 0] + 0.3 * Z[:, 1]
X3 = -0.5 * Z[:, 0] + 0.5 * Z[:, 2]
data = np.column_stack([X1, X2, X3])
corr = np.corrcoef(data, rowvar=False)
im = ax3.imshow(corr, cmap='RdYlGn', vmin=-1, vmax=1)
for i in range(3):
    for j in range(3):
        ax3.text(j, i, f'{corr[i,j]:.2f}', ha='center', va='center',
                 color='black', fontweight='bold', fontsize=12)
ax3.set_xticks([0, 1, 2])
ax3.set_yticks([0, 1, 2])
ax3.set_xticklabels(['X1', 'X2', 'X3'], color='#d1d5db')
ax3.set_yticklabels(['X1', 'X2', 'X3'], color='#d1d5db')
ax3.set_title('Correlation Matrix', color='#10b981', fontsize=13, fontweight='bold')
plt.colorbar(im, ax=ax3)

# --- Panel 4: Conditional expectation E[X|Y] ---
ax4 = axes[1, 1]
rho = 0.8
Y = np.random.randn(N)
X = rho * Y + np.sqrt(1 - rho**2) * np.random.randn(N)
bins = np.linspace(-3, 3, 30)
bin_centers = 0.5 * (bins[:-1] + bins[1:])
cond_means = []
for i in range(len(bins) - 1):
    mask = (Y >= bins[i]) & (Y < bins[i+1])
    cond_means.append(np.mean(X[mask]) if mask.sum() > 0 else 0)
ax4.scatter(Y[::20], X[::20], alpha=0.15, s=3, color='#6ee7b7')
ax4.plot(bin_centers, cond_means, color='#10b981', linewidth=3, label='E[X|Y] (empirical)')
ax4.plot(bin_centers, rho * bin_centers, color='#f87171', linewidth=2,
         linestyle='--', label=f'E[X|Y] = {rho}Y (theory)')
ax4.set_title('Conditional Expectation E[X|Y]', color='#10b981', fontsize=13, fontweight='bold')
ax4.set_xlabel('Y')
ax4.set_ylabel('X')
ax4.legend(fontsize=9, facecolor='#1e293b', edgecolor='#10b981', labelcolor='#d1d5db')

plt.tight_layout()
plt.savefig('output.png', dpi=150, bbox_inches='tight', facecolor='#0f172a')
print("Expectation and variance laboratory complete.")
print(f"Sample moments of Exp(2): mean={mu:.4f}, var={sigma**2:.4f}")
print(f"Theory: mean=2, var=4")
print(f"Correlation matrix:\n{np.round(corr, 3)}")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

←Common Distributions Limit Theorems→

Share:X Reddit LinkedIn