Part I: Probability | Chapter 1

Probability Axioms

Kolmogorov's axiomatic foundations of probability theory

Historical Context

Before Andrey Nikolaevich Kolmogorov published his landmark monograph Grundbegriffe der Wahrscheinlichkeitsrechnung in 1933, probability was a collection of ad hoc techniques lacking a unified mathematical foundation. Laplace had formulated classical probability as ratios of favorable to total outcomes, but this required equally likely events and could not handle continuous sample spaces. Richard von Mises proposed a frequentist axiomatization based on limiting relative frequencies, but it had logical difficulties with the definition of randomness. Kolmogorov's insight was to identify probability with measure theory, building on the work of Emile Borel and Henri Lebesgue. By grounding probability in sigma-algebras and countably additive set functions, Kolmogorov unified discrete and continuous probability, resolved paradoxes, and made probability a rigorous branch of mathematics.

This axiomatic approach had far-reaching consequences. It enabled the rigorous treatment of stochastic processes, martingale theory, ergodic theory, and ultimately the modern theory of mathematical finance, statistical mechanics, and information theory. Today, Kolmogorov's axioms remain the universally accepted foundation of probability.

1.1 Sample Spaces and Events

Every probabilistic model begins with a sample space, which we denote by $\Omega$. This is the set of all possible outcomes of a random experiment. For a single coin toss, $\Omega = \{H, T\}$. For the lifetime of a light bulb,$\Omega = [0, \infty)$.

Definition: Sample Space

A sample space $\Omega$ is a non-empty set whose elements$\omega \in \Omega$ are called outcomes or sample points. An event is a subset $A \subseteq \Omega$ to which we wish to assign a probability.

For finite or countably infinite sample spaces, we can assign probabilities to every subset. However, for uncountable sample spaces (such as $\Omega = \mathbb{R}$), we cannot consistently assign probabilities to all subsets. This was demonstrated by Vitali's construction (1905), which showed the existence of non-measurable sets under the axiom of choice. The solution is to restrict our attention to a suitable collection of “measurable” subsets.

Examples of Sample Spaces

Example 1: Rolling Two Dice

$$\Omega = \{(i,j) : i,j \in \{1,2,3,4,5,6\}\}$$

This sample space has $|\Omega| = 36$ equally likely outcomes under the assumption of fair dice. The event “the sum is 7” corresponds to the subset $A = \{(1,6),(2,5),(3,4),(4,3),(5,2),(6,1)\}$ with$|A| = 6$.

Example 2: Continuous Lifetime

$$\Omega = [0, \infty)$$

For the lifetime of a component, each outcome is a non-negative real number. The event “the component lasts more than 100 hours” is $A = (100, \infty)$.

Example 3: Infinite Coin Tosses

$$\Omega = \{H, T\}^{\mathbb{N}} = \{(\omega_1, \omega_2, \omega_3, \ldots) : \omega_i \in \{H, T\}\}$$

This is an uncountable set (it has the cardinality of the continuum). Events like “heads appears infinitely often” require careful measure-theoretic treatment.

1.2 Sigma-Algebras

A sigma-algebra (or $\sigma$-algebra) is a collection of subsets of $\Omega$ that is closed under the operations we need to do probability. It captures precisely which events we can measure.

Definition: Sigma-Algebra

A collection $\mathcal{F} \subseteq 2^{\Omega}$ is a $\sigma$-algebra if:

(i) $\Omega \in \mathcal{F}$ (the entire sample space is an event)
(ii) If $A \in \mathcal{F}$, then $A^c \in \mathcal{F}$ (closure under complementation)
(iii) If $A_1, A_2, A_3, \ldots \in \mathcal{F}$, then $\bigcup_{n=1}^{\infty} A_n \in \mathcal{F}$ (closure under countable unions)

Figure. Venn diagram showing events A and B within sample space $\Omega$, their intersection $A \cap B$, union $A \cup B$, and complement $A^c$.

Derivation 1: Closure Properties

From the three axioms above, we can derive several important closure properties:

The empty set is in $\mathcal{F}$: Since $\Omega \in \mathcal{F}$ and$\mathcal{F}$ is closed under complements, we have$\emptyset = \Omega^c \in \mathcal{F}$.

Closure under countable intersections: By De Morgan's law,

$$\bigcap_{n=1}^{\infty} A_n = \left(\bigcup_{n=1}^{\infty} A_n^c\right)^c$$

Since each $A_n^c \in \mathcal{F}$ by (ii), their countable union is in$\mathcal{F}$ by (iii), and the complement of that union is in $\mathcal{F}$by (ii) again.

Closure under set difference: $A \setminus B = A \cap B^c \in \mathcal{F}$whenever $A, B \in \mathcal{F}$.

The Borel Sigma-Algebra

For $\Omega = \mathbb{R}$, the most important $\sigma$-algebra is the Borel $\sigma$-algebra $\mathcal{B}(\mathbb{R})$, defined as the smallest $\sigma$-algebra containing all open intervals $(a, b)$. Equivalently, it is generated by the collection of all open sets, all closed sets, or all half-open intervals$(-\infty, a]$. The Borel $\sigma$-algebra contains all sets you would ever encounter in practice: open sets, closed sets, countable unions and intersections of these, and so on. Non-Borel sets exist but require the axiom of choice to construct.

Formal construction: Given a collection $\mathcal{C}$ of subsets of$\Omega$, we define $\sigma(\mathcal{C})$ as the intersection of all$\sigma$-algebras containing $\mathcal{C}$:

$$\sigma(\mathcal{C}) = \bigcap \{\mathcal{F} : \mathcal{F} \text{ is a } \sigma\text{-algebra and } \mathcal{C} \subseteq \mathcal{F}\}$$

This intersection is itself a $\sigma$-algebra (the axioms are preserved under arbitrary intersections) and is the smallest $\sigma$-algebra containing $\mathcal{C}$. For $\mathcal{B}(\mathbb{R})$, we take $\mathcal{C}$ to be the open sets of $\mathbb{R}$.

1.3 Kolmogorov's Axioms

With the notions of sample space and $\sigma$-algebra in hand, we can state Kolmogorov's axioms. These three simple axioms provide the complete foundation for probability theory.

Kolmogorov's Axioms (1933)

A probability measure on a measurable space $(\Omega, \mathcal{F})$ is a function $P: \mathcal{F} \to [0,1]$ satisfying:

Axiom 1 (Non-negativity): For every event $A \in \mathcal{F}$,
$$P(A) \geq 0$$
Axiom 2 (Normalization): The probability of the entire sample space is one:
$$P(\Omega) = 1$$
Axiom 3 (Countable additivity): For any countable sequence of pairwise disjoint events $A_1, A_2, A_3, \ldots \in \mathcal{F}$ (meaning$A_i \cap A_j = \emptyset$ for $i \neq j$):
$$P\!\left(\bigcup_{n=1}^{\infty} A_n\right) = \sum_{n=1}^{\infty} P(A_n)$$

The triple $(\Omega, \mathcal{F}, P)$ is called a probability space. Everything in probability theory ultimately refers back to some underlying probability space.

Derivation 2: Basic Properties from the Axioms

From Kolmogorov's three axioms, we derive the essential properties of probability:

Property 1: $P(\emptyset) = 0$

Set $A_1 = \Omega$ and $A_n = \emptyset$ for $n \geq 2$. These are pairwise disjoint and $\bigcup_{n=1}^{\infty} A_n = \Omega$. By Axiom 3:

$$P(\Omega) = P(\Omega) + \sum_{n=2}^{\infty} P(\emptyset) = P(\Omega) + \sum_{n=2}^{\infty} P(\emptyset)$$

Since $P(\Omega) = 1$ is finite, we can subtract it from both sides to get$\sum_{n=2}^{\infty} P(\emptyset) = 0$. Since $P(\emptyset) \geq 0$by Axiom 1, we must have $P(\emptyset) = 0$.

Property 2: Complement Rule

Since $A$ and $A^c$ are disjoint and $A \cup A^c = \Omega$:

$$P(A) + P(A^c) = P(\Omega) = 1 \quad \Longrightarrow \quad P(A^c) = 1 - P(A)$$

Property 3: Monotonicity

If $A \subseteq B$, write $B = A \cup (B \setminus A)$ where the union is disjoint. Then:

$$P(B) = P(A) + P(B \setminus A) \geq P(A)$$

since $P(B \setminus A) \geq 0$.

Property 4: Subadditivity (Union Bound / Boole's Inequality)

$$P\!\left(\bigcup_{n=1}^{\infty} A_n\right) \leq \sum_{n=1}^{\infty} P(A_n)$$

This holds even when the events are not disjoint. The proof follows by writing the union as a disjoint union and applying monotonicity to each piece.

1.4 Probability Spaces

Let us examine several important probability spaces to build intuition.

Discrete Probability Spaces

When $\Omega$ is finite or countably infinite, we take $\mathcal{F} = 2^{\Omega}$(the power set) and define $P$ by specifying $P(\{\omega\})$ for each$\omega \in \Omega$. Then for any event $A$:

$$P(A) = \sum_{\omega \in A} P(\{\omega\})$$

The values $p(\omega) = P(\{\omega\})$ must satisfy$p(\omega) \geq 0$ for all $\omega$ and$\sum_{\omega \in \Omega} p(\omega) = 1$.

Derivation 3: The Uniform Distribution on a Finite Set

If $|\Omega| = N$ and each outcome is equally likely, then$p(\omega) = 1/N$ for all $\omega$. For any event $A$:

$$P(A) = \sum_{\omega \in A} \frac{1}{N} = \frac{|A|}{N}$$

This is the classical definition of probability due to Laplace. It is a special case of Kolmogorov's axioms, not a separate definition.

Verification of axioms: (1) $P(A) = |A|/N \geq 0$. (2)$P(\Omega) = N/N = 1$. (3) For disjoint $A, B$:$P(A \cup B) = |A \cup B|/N = (|A| + |B|)/N = P(A) + P(B)$.

Continuous Probability Spaces

When $\Omega$ is uncountable (e.g., $\Omega = \mathbb{R}$), we typically define probabilities via a probability density function (pdf) $f$:

$$P(A) = \int_A f(x) \, dx$$

where $f(x) \geq 0$ and $\int_{-\infty}^{\infty} f(x) \, dx = 1$. The$\sigma$-algebra is $\mathcal{B}(\mathbb{R})$. Note that for continuous distributions, $P(\{x\}) = 0$ for every individual point $x$, yet$P(\mathbb{R}) = 1$. This is perfectly consistent because the axioms only requirecountable additivity, not uncountable additivity.

Example: Uniform Distribution on [0, 1]

Let $\Omega = [0, 1]$, $\mathcal{F} = \mathcal{B}([0,1])$, and$f(x) = 1$ for $x \in [0,1]$. Then:

$$P([a, b]) = \int_a^b 1 \, dx = b - a \quad \text{for } 0 \leq a \leq b \leq 1$$

This is Lebesgue measure restricted to $[0,1]$, the simplest continuous probability space.

1.5 The Inclusion-Exclusion Principle

One of the most fundamental combinatorial tools in probability is the inclusion-exclusion principle, which gives the exact probability of a union of events.

Derivation 4: Two Events

Write $A \cup B = A \cup (B \setminus A)$ as a disjoint union. By Axiom 3:

$$P(A \cup B) = P(A) + P(B \setminus A)$$

Now write $B = (A \cap B) \cup (B \setminus A)$ as a disjoint union:

$$P(B) = P(A \cap B) + P(B \setminus A) \quad \Longrightarrow \quad P(B \setminus A) = P(B) - P(A \cap B)$$

Substituting:

$$\boxed{P(A \cup B) = P(A) + P(B) - P(A \cap B)}$$

Derivation 5: General Inclusion-Exclusion

For $n$ events $A_1, A_2, \ldots, A_n$:

$$P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_{i=1}^n P(A_i) - \sum_{i<j} P(A_i \cap A_j) + \sum_{i<j<k} P(A_i \cap A_j \cap A_k) - \cdots + (-1)^{n+1} P(A_1 \cap \cdots \cap A_n)$$

In compact notation:

$$P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_{k=1}^n (-1)^{k+1} \sum_{1 \leq i_1 < \cdots < i_k \leq n} P(A_{i_1} \cap \cdots \cap A_{i_k})$$

Proof by induction: The base case $n = 2$ was proven above. Assuming the formula holds for $n-1$ events, write:

$$P\!\left(\bigcup_{i=1}^n A_i\right) = P\!\left(\bigcup_{i=1}^{n-1} A_i\right) + P(A_n) - P\!\left(\bigcup_{i=1}^{n-1} (A_i \cap A_n)\right)$$

Applying the induction hypothesis to both the first and third terms and collecting like terms yields the formula for $n$ events. The key insight is that$(A_i \cap A_n) \cap (A_j \cap A_n) = A_i \cap A_j \cap A_n$, so the induction hypothesis applies to the collection $\{A_i \cap A_n\}_{i=1}^{n-1}$.

Application: The Matching Problem (Derangements)

Consider $n$ letters randomly placed in $n$ envelopes. What is the probability that at least one letter ends up in the correct envelope? Let $A_i$ be the event that letter $i$ is in envelope $i$. Then:

$$P(A_{i_1} \cap \cdots \cap A_{i_k}) = \frac{(n-k)!}{n!}$$

since there are $(n-k)!$ ways to arrange the remaining letters, out of $n!$ total. There are $\binom{n}{k}$ terms in the $k$-th sum, so:

$$P\!\left(\bigcup_{i=1}^n A_i\right) = \sum_{k=1}^n (-1)^{k+1} \binom{n}{k} \frac{(n-k)!}{n!} = \sum_{k=1}^n \frac{(-1)^{k+1}}{k!}$$

The probability of a derangement (no matches) is:

$$P(\text{derangement}) = 1 - \sum_{k=1}^n \frac{(-1)^{k+1}}{k!} = \sum_{k=0}^n \frac{(-1)^k}{k!} \approx e^{-1} \approx 0.3679$$

Remarkably, this probability converges extremely rapidly to $1/e$ as$n \to \infty$. Even for $n = 5$, the probability is$1 - 1 + 1/2 - 1/6 + 1/24 - 1/120 = 0.3667$, very close to $1/e$.

1.6 Continuity of Probability

Countable additivity has a powerful consequence: probability measures are continuous with respect to monotone sequences of events.

Theorem: Continuity from Below

If $A_1 \subseteq A_2 \subseteq A_3 \subseteq \cdots$ (an increasing sequence), then:

$$P\!\left(\bigcup_{n=1}^{\infty} A_n\right) = \lim_{n \to \infty} P(A_n)$$

Proof: Define $B_1 = A_1$ and$B_n = A_n \setminus A_{n-1}$ for $n \geq 2$. The$B_n$ are pairwise disjoint and $\bigcup_{n=1}^{\infty} B_n = \bigcup_{n=1}^{\infty} A_n$. By countable additivity:

$$P\!\left(\bigcup_{n=1}^{\infty} A_n\right) = \sum_{n=1}^{\infty} P(B_n) = \lim_{N \to \infty} \sum_{n=1}^{N} P(B_n) = \lim_{N \to \infty} P(A_N)$$

where the last equality holds because $A_N = \bigcup_{n=1}^{N} B_n$ is a finite disjoint union, so $P(A_N) = \sum_{n=1}^{N} P(B_n)$ by finite additivity.

Theorem: Continuity from Above

If $A_1 \supseteq A_2 \supseteq A_3 \supseteq \cdots$ (a decreasing sequence), then:

$$P\!\left(\bigcap_{n=1}^{\infty} A_n\right) = \lim_{n \to \infty} P(A_n)$$

This follows by applying continuity from below to $A_n^c$.

These continuity properties are essential for proving many results in probability theory, including the Borel-Cantelli lemmas and the existence of certain limiting objects.

1.7 The Borel-Cantelli Lemmas

The Borel-Cantelli lemmas are among the most useful tools in probability theory. They describe the conditions under which events occur infinitely often.

First Borel-Cantelli Lemma

If $\sum_{n=1}^{\infty} P(A_n) < \infty$, then$P(\limsup_{n \to \infty} A_n) = 0$, where

$$\limsup_{n \to \infty} A_n = \bigcap_{n=1}^{\infty} \bigcup_{k=n}^{\infty} A_k = \{A_n \text{ i.o.}\}$$

In words: if the total probability is finite, then with probability 1 only finitely many of the events occur.

Proof: Let $B_n = \bigcup_{k=n}^{\infty} A_k$. Then$B_1 \supseteq B_2 \supseteq \cdots$ and$\limsup A_n = \bigcap_{n=1}^{\infty} B_n$. By continuity from above:

$$P(\limsup A_n) = \lim_{n \to \infty} P(B_n) \leq \lim_{n \to \infty} \sum_{k=n}^{\infty} P(A_k) = 0$$

where the inequality uses the union bound and the limit is zero because the tail of a convergent series tends to zero.

Second Borel-Cantelli Lemma

If $\sum_{n=1}^{\infty} P(A_n) = \infty$ and the $A_n$ are independent, then $P(\limsup_{n \to \infty} A_n) = 1$. That is, with probability 1 infinitely many of the events occur. The independence condition is essential; without it, the conclusion can fail.

1.8 Applications

Application 1: Reliability Theory

In reliability engineering, a system consists of components with known failure probabilities. Using the axioms and inclusion-exclusion, we can compute system reliability. For a series system (all must work), the reliability is $R = \prod_{i=1}^n (1 - p_i)$ where$p_i$ is the failure probability of component $i$ (assuming independence). For a parallel system (at least one must work), inclusion-exclusion gives:

$$P(\text{system fails}) = P\!\left(\bigcap_{i=1}^n F_i\right) = \prod_{i=1}^n p_i$$

Application 2: Genetics and Hardy-Weinberg Equilibrium

The Hardy-Weinberg principle uses probability axioms to predict genotype frequencies. If allele$A$ has frequency $p$ and allele $a$ has frequency$q = 1 - p$, then under random mating the genotype frequencies form the probability space with $P(AA) = p^2$, $P(Aa) = 2pq$, $P(aa) = q^2$. The normalization axiom is satisfied: $p^2 + 2pq + q^2 = (p + q)^2 = 1$.

Application 3: The Birthday Problem

With $n$ people and 365 days, the probability that all birthdays are distinct is:

$$P(\text{all distinct}) = \prod_{k=0}^{n-1} \frac{365 - k}{365} = \frac{365!}{365^n (365-n)!}$$

For $n = 23$, $P(\text{at least one match}) \approx 0.507$, exceeding 50%. This counter-intuitive result follows directly from the multiplicative structure of the probability space.

Application 4: Cryptography and Random Number Testing

In cryptographic applications, the axioms of probability provide the framework for measuring the randomness of key generation. A cryptographically secure pseudorandom number generator (CSPRNG) must satisfy statistical tests derived from these axioms: uniform distribution (each bit has$P(1) = P(0) = 1/2$), independence between bits, and chi-squared goodness-of-fit tests based on the probability space structure.

1.9 Python Simulation

The following simulation demonstrates the inclusion-exclusion principle and the derangement probability, verifying our theoretical results computationally.

Probability Axioms: Inclusion-Exclusion and Derangements

Python

script.py108 lines

import random
import math

random.seed(42)

# --- Simulation 1: Birthday Problem ---
def birthday_simulation(n_people, n_trials=100000):
    matches = 0
    for _ in range(n_trials):
        birthdays = [random.randint(1, 365) for _ in range(n_people)]
        if len(set(birthdays)) < n_people:
            matches += 1
    return matches / n_trials

print("=== Birthday Problem ===")
print("People  Simulated   Theoretical")
for n in [10, 20, 23, 30, 40, 50]:
    sim_prob = birthday_simulation(n, 50000)
    # Theoretical: 1 - product of (365-k)/365
    theo_prob = 1.0
    for k in range(n):
        theo_prob *= (365 - k) / 365
    theo_prob = 1 - theo_prob
    print("  %3d    %.4f      %.4f" % (n, sim_prob, theo_prob))

# --- Simulation 2: Derangements ---
print("")
print("=== Derangement Probability ===")
print("  n   Simulated   Theoretical   1/e")

def is_derangement(perm):
    for i in range(len(perm)):
        if perm[i] == i:
            return False
    return True

for n in [3, 4, 5, 6, 7, 8, 10]:
    count = 0
    trials = 80000
    elements = list(range(n))
    for _ in range(trials):
        perm = elements[:]
        random.shuffle(perm)
        if is_derangement(perm):
            count += 1
    sim_val = count / trials
    # Theoretical
    theo_val = sum((-1)**k / math.factorial(k) for k in range(n + 1))
    print("  %2d    %.4f      %.4f     %.4f" % (n, sim_val, theo_val, 1/math.e))

# --- Simulation 3: Inclusion-Exclusion for 3 Events ---
print("")
print("=== Inclusion-Exclusion (3 Events) ===")
n_trials = 200000
# Three events on a sample space of integers 1..100
# A = multiples of 2, B = multiples of 3, C = multiples of 5
omega = list(range(1, 101))
A = set(x for x in omega if x % 2 == 0)
B = set(x for x in omega if x % 3 == 0)
C = set(x for x in omega if x % 5 == 0)

# Exact probabilities (uniform distribution)
N = len(omega)
pA = len(A) / N
pB = len(B) / N
pC = len(C) / N
pAB = len(A & B) / N
pAC = len(A & C) / N
pBC = len(B & C) / N
pABC = len(A & B & C) / N

union_exact = pA + pB + pC - pAB - pAC - pBC + pABC
union_direct = len(A | B | C) / N

print("P(A) = %.2f, P(B) = %.2f, P(C) = %.2f" % (pA, pB, pC))
print("P(AB) = %.2f, P(AC) = %.2f, P(BC) = %.2f" % (pAB, pAC, pBC))
print("P(ABC) = %.2f" % pABC)
print("P(A u B u C) by inclusion-exclusion = %.4f" % union_exact)
print("P(A u B u C) by direct count        = %.4f" % union_direct)

# Simulate
count_union = 0
for _ in range(n_trials):
    x = random.choice(omega)
    if x in A or x in B or x in C:
        count_union += 1
print("P(A u B u C) by simulation           = %.4f" % (count_union / n_trials))

# --- Simulation 4: Borel-Cantelli illustration ---
print("")
print("=== Borel-Cantelli Lemma Illustration ===")
# Events A_n with P(A_n) = 1/n^2 (summable -> finitely many occur)
max_n = 1000
total_occurrences = []
for trial in range(5000):
    count = 0
    for n in range(1, max_n + 1):
        if random.random() < 1.0 / (n * n):
            count += 1
    total_occurrences.append(count)

avg_occ = sum(total_occurrences) / len(total_occurrences)
max_occ = max(total_occurrences)
print("P(A_n) = 1/n^2 (sum converges = %.4f)" % sum(1.0/(n*n) for n in range(1, max_n+1)))
print("Average events occurring: %.2f" % avg_occ)
print("Maximum events in any trial: %d" % max_occ)
print("Theory predicts: finitely many (confirmed)")

Click Run to execute the Python code

Code will be executed with Python 3 on the server

1.10 Summary and Key Takeaways

Kolmogorov's Axioms

Three axioms (non-negativity, normalization, countable additivity) on a measurable space $(\Omega, \mathcal{F})$ define a probability measure $P$. The triple $(\Omega, \mathcal{F}, P)$ is a probability space.

Sigma-Algebras

A $\sigma$-algebra specifies which events are measurable. For continuous sample spaces, we use the Borel $\sigma$-algebra, generated by open sets.

Inclusion-Exclusion

The probability of a union of events is computed by alternating sums of intersection probabilities. This generalizes $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.

Continuity and Borel-Cantelli

Probability measures are continuous with respect to monotone sequences. The Borel-Cantelli lemmas characterize when events occur infinitely often.

Applications

The axiomatic framework applies to reliability theory, genetics, the birthday problem, cryptography, and every other domain where uncertainty must be quantified rigorously.

Share:X Reddit LinkedIn