Expectation & Variance
Moments, moment generating functions, covariance, and conditional expectation
Historical Context
The concept of mathematical expectation traces back to Christiaan Huygens, who introduced the notion of “expected value” in his 1657 treatise De Ratiociniis in Ludo Aleae. The idea of variance as a measure of dispersion was formalized much later by Ronald Fisher in 1918, who also introduced the term “variance” itself. Pafnuty Chebyshev established his celebrated inequality in 1867, providing a universal bound on tail probabilities using only the first two moments. The moment generating function technique was developed by Pierre-Simon Laplace, while the tower property of conditional expectation became a cornerstone of modern probability theory through the work of Andrey Kolmogorov and Joseph Doob in the 1930s and 1940s.
1.1 Expectation
The expectation (or expected value) of a random variable provides a single number summarizing the “center” of its distribution, weighted by probability.
Discrete Case
If $X$ takes countably many values $x_1, x_2, \ldots$ with PMF $p(x_i) = P(X = x_i)$, then
provided the sum converges absolutely. If $\sum_i |x_i| p(x_i) = \infty$, the expectation does not exist (e.g., the Cauchy distribution).
Continuous Case
If $X$ has PDF $f(x)$, then
Again, existence requires absolute integrability: $\int |x| f(x) \, dx < \infty$.
Linearity of Expectation
For any random variables $X, Y$ and constants $a, b$:
This holds regardless of whether $X$ and $Y$ are independent. Linearity extends to any finite (or countable, under absolute convergence) linear combination.
Law of the Unconscious Statistician (LOTUS)
To compute $E[g(X)]$ one does not need the distribution of $g(X)$:
LOTUS is essential for computing moments: set $g(x) = x^k$ to get $E[X^k]$, the $k$-th moment about the origin.
1.2 Variance and Chebyshev's Inequality
Definition and Properties
The variance measures the average squared deviation from the mean:
Key properties:
- $\text{Var}(X) \ge 0$ with equality iff $X$ is a.s. constant.
- $\text{Var}(aX + b) = a^2 \text{Var}(X)$ for constants $a, b$.
- If $X, Y$ are independent: $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y)$.
- The standard deviation is $\sigma = \sqrt{\text{Var}(X)}$, in the same units as $X$.
Chebyshev's Inequality
For any random variable $X$ with finite mean $\mu$ and variance $\sigma^2$, and any $k > 0$:
Equivalently, $P(|X - \mu| \ge t) \le \sigma^2 / t^2$. This bound is distribution-free and remarkably general, though often conservative. It is tight for a two-point distribution that places mass $1/(2k^2)$ at $\mu \pm k\sigma$and remaining mass at $\mu$.
1.3 Higher Moments and Moment Generating Functions
Moment Generating Function (MGF)
The MGF of $X$ is defined as
provided the expectation exists in a neighborhood of $t = 0$. Moments are extracted via $E[X^k] = M_X^{(k)}(0)$. The MGF uniquely determines the distribution (when it exists). If $X \perp Y$, then $M_{X+Y}(t) = M_X(t) M_Y(t)$.
Skewness and Kurtosis
The skewness measures asymmetry and the kurtosis measures tail heaviness:
The normal distribution has skewness 0 and kurtosis 3. The excess kurtosis is$\text{Kurt}(X) - 3$, so that the normal has excess kurtosis 0. Distributions with positive excess kurtosis (“leptokurtic”) have heavier tails than the normal.
1.4 Covariance and Correlation
Covariance
The covariance of $X$ and $Y$ is
Key properties:
- $\text{Cov}(X, X) = \text{Var}(X)$
- $\text{Cov}(X, Y) = \text{Cov}(Y, X)$ (symmetry)
- $\text{Cov}(aX + b, cY + d) = ac \, \text{Cov}(X, Y)$ (bilinearity)
- $\text{Var}(X + Y) = \text{Var}(X) + \text{Var}(Y) + 2\text{Cov}(X, Y)$
- Independence implies $\text{Cov}(X, Y) = 0$, but the converse is false.
Pearson Correlation Coefficient
The correlation normalizes covariance to the range $[-1, 1]$:
$|\rho| = 1$ iff $Y = aX + b$ a.s. for some constants with $a \ne 0$. Correlation measures linear dependence only; uncorrelated random variables can still be strongly dependent (e.g., $X \sim N(0,1)$and $Y = X^2$ are uncorrelated but not independent).
1.5 Conditional Expectation
Definition
The conditional expectation of $X$ given $Y = y$ is
As a function of $Y$, $E[X \mid Y]$ is itself a random variable, representing the best prediction of $X$ given $Y$ in the mean-squared-error sense.
Law of Total Expectation (Tower Property)
The tower property (also called the law of iterated expectations) states:
More generally, for $\sigma$-algebras $\mathcal{G} \subseteq \mathcal{H}$:$E[E[X \mid \mathcal{H}] \mid \mathcal{G}] = E[X \mid \mathcal{G}]$.
There is also a law of total variance:
This decomposes total variability into the average “within-group” variance and the “between-group” variance.
Computational Laboratory
We compute moments, verify Chebyshev's inequality empirically, examine covariance structure, and illustrate conditional expectation via simulation.
Moments, Chebyshev, Covariance & Conditional Expectation
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server