Conditional Probability
Bayes' theorem, independence, and the law of total probability
Historical Context
The Reverend Thomas Bayes (1701–1761) was an English Presbyterian minister and mathematician who formulated the first version of what we now call Bayes' theorem. His work, published posthumously in 1763 as “An Essay towards solving a Problem in the Doctrine of Chances,” addressed the inverse probability problem: given observed outcomes, what can we infer about the underlying probability? Pierre-Simon Laplace independently discovered and significantly generalized Bayes' result in 1774, applying it to problems in astronomy, population statistics, and jurisprudence. Laplace's formulation is closer to the modern statement of Bayes' theorem.
The concept of conditional probability was formalized within Kolmogorov's axiomatization in 1933. Kolmogorov defined conditional probability as a ratio of probabilities when the conditioning event has positive probability, and later extended the concept to conditioning on events of probability zero using the theory of conditional expectations and the Radon-Nikodym theorem. This extension is essential for continuous random variables and modern probability theory.
Today, Bayes' theorem is the cornerstone of Bayesian statistics, machine learning, medical diagnostics, spam filtering, and countless other applications. The “Bayesian revolution” in statistics, enabled by computational advances in MCMC methods since the 1990s, has made Bayesian inference one of the dominant paradigms in data science.
2.1 Definition of Conditional Probability
If we learn that event $B$ has occurred, how should we update the probability of event $A$? The answer is given by the conditional probability.
Definition: Conditional Probability
For events $A, B$ with $P(B) > 0$, the conditional probabilityof $A$ given $B$ is:
Intuitively, given that $B$ has occurred, the new sample space is restricted to$B$. We rescale by $1/P(B)$ to maintain normalization. One can verify that for fixed $B$ with $P(B) > 0$, the function$A \mapsto P(A \mid B)$ is itself a probability measure on $(\Omega, \mathcal{F})$.
Derivation 1: Conditional Probability is a Probability Measure
We verify Kolmogorov's three axioms for $Q(A) := P(A \mid B)$:
(1) Non-negativity: $Q(A) = P(A \cap B)/P(B) \geq 0$ since both numerator and denominator are non-negative.
(2) Normalization: $Q(\Omega) = P(\Omega \cap B)/P(B) = P(B)/P(B) = 1$.
(3) Countable additivity: If $A_1, A_2, \ldots$ are pairwise disjoint:
where we used the fact that the $A_n \cap B$ are also pairwise disjoint. Therefore $P(\cdot \mid B)$ satisfies all axioms and is a valid probability measure.
The Multiplication Rule
Rearranging the definition gives the multiplication rule:
This extends to multiple events by the chain rule:
2.2 The Law of Total Probability
The law of total probability allows us to compute the probability of an event by conditioning on a partition of the sample space.
Theorem: Law of Total Probability
If $B_1, B_2, \ldots, B_n$ form a partition of $\Omega$ (i.e., they are pairwise disjoint and $\bigcup_{i=1}^n B_i = \Omega$) with $P(B_i) > 0$for all $i$, then for any event $A$:
Derivation 2: Proof of the Law of Total Probability
Since $\{B_i\}$ is a partition:
The sets $A \cap B_i$ are pairwise disjoint (since the $B_i$ are), so by countable additivity:
The last step uses the multiplication rule $P(A \cap B_i) = P(A \mid B_i) P(B_i)$.
Example: Medical Testing
A disease affects 1% of the population. A test has sensitivity 95% (true positive rate) and specificity 98% (true negative rate). What is the probability that a randomly selected person tests positive?
Let $D$ = has disease, $T$ = tests positive. The partition is$\{D, D^c\}$.
About 2.93% of people test positive. This is crucial for the Bayesian calculation that follows.
2.3 Bayes' Theorem
Derivation 3: Full Derivation of Bayes' Theorem
Step 1: Start from the definition of conditional probability:
Step 2: Apply the multiplication rule to the numerator:
Step 3: Apply the law of total probability to the denominator:
Step 4: Combine to obtain Bayes' theorem:
Bayes' Theorem (Simple Form)
For two events with $P(A) > 0$ and $P(B) > 0$:
Here $P(B)$ is the prior, $P(A \mid B)$ is the likelihood, $P(B \mid A)$ is the posterior, and$P(A)$ is the evidence (marginal likelihood).
Example: Medical Test (Continued)
Given a positive test result, what is the probability the person actually has the disease?
Despite a 95% sensitive test, only about 32.4% of positive results are true positives. This counter-intuitive result—the base rate fallacy—arises because the disease is rare. The 2% false positive rate applied to 99% of the healthy population generates more false positives than the 95% detection rate applied to the 1% diseased population.
2.4 Independence
Definition: Independence
Events $A$ and $B$ are independent if:
Equivalently, $P(A \mid B) = P(A)$ when $P(B) > 0$. Knowing$B$ has occurred does not change the probability of $A$.
Mutual vs. Pairwise Independence
Events $A_1, \ldots, A_n$ are mutually independent if for every subset $S \subseteq \{1, \ldots, n\}$:
This requires $2^n - n - 1$ conditions to hold, not just the $\binom{n}{2}$pairwise conditions. Pairwise independence does not imply mutual independence.
Counterexample: Pairwise but not Mutually Independent
Toss two fair coins. Let $A$ = first coin heads, $B$ = second coin heads,$C$ = both coins show the same face. Then:
$P(A) = P(B) = P(C) = 1/2$
$P(A \cap B) = 1/4 = P(A)P(B)$ (independent)
$P(A \cap C) = 1/4 = P(A)P(C)$ (independent)
$P(B \cap C) = 1/4 = P(B)P(C)$ (independent)
But $P(A \cap B \cap C) = 1/4 \neq 1/8 = P(A)P(B)P(C)$. All three pairwise conditions hold but the triple condition fails.
Derivation 4: Independence Implies Complement Independence
If $A$ and $B$ are independent, then so are $A$ and$B^c$:
By the same argument, $A^c$ and $B$ are independent, and$A^c$ and $B^c$ are independent.
2.5 Bayesian Updating
Bayesian updating is the process of sequentially applying Bayes' theorem as new data arrives. After observing data $D_1$, the posterior becomes the new prior for updating with$D_2$, and so on.
Derivation 5: Sequential Bayesian Updating
Let $H$ be a hypothesis and $D_1, D_2, \ldots, D_n$ be sequentially observed data points. After observing $D_1$:
After observing $D_2$, using $P(H \mid D_1)$ as the new prior:
If the data are conditionally independent given $H$ (i.e.,$P(D_2 \mid H, D_1) = P(D_2 \mid H)$), then after $n$ observations:
The posterior is proportional to the prior times the product of likelihoods. The normalizing constant is determined by the requirement that probabilities sum (or integrate) to one over all hypotheses.
The Odds Form of Bayes' Theorem
An elegant alternative formulation uses odds. The posterior odds equal the prior odds times the likelihood ratio (Bayes factor):
This form is particularly useful because the normalizing constant $P(D)$ cancels out, and it clearly separates the contribution of the data (likelihood ratio) from our prior beliefs.
Example: Updating Coin Bias
Suppose we suspect a coin may be biased. We consider two hypotheses:$H_0$: the coin is fair ($p = 0.5$) with prior probability 0.8, and $H_1$: the coin is biased ($p = 0.7$) with prior probability 0.2. We flip the coin 10 times and observe 8 heads.
The likelihood ratio is:
The posterior odds are:
Converting back: $P(H_1 \mid D) = 1.295 / (1 + 1.295) \approx 0.564$. The data has shifted our belief from 20% to 56.4% that the coin is biased.
2.6 Applications
Application 1: Spam Filtering (Naive Bayes)
The Naive Bayes classifier uses Bayes' theorem with a conditional independence assumption to classify emails. Given words $w_1, \ldots, w_n$ in an email:
Despite the strong (and often violated) independence assumption, Naive Bayes classifiers perform remarkably well in practice, a phenomenon studied by Domingos and Pazzani (1997).
Application 2: The Monty Hall Problem
In this famous problem, a prize is behind one of three doors. You pick door 1. The host, who knows where the prize is, opens door 3 (which has no prize). Should you switch to door 2?
Let $C_i$ = prize behind door $i$, $H_3$ = host opens door 3. By Bayes' theorem:
Switching doubles your chance of winning from $1/3$ to $2/3$. The key is that the host's action provides information: $P(H_3 \mid C_2) = 1$ because the host must open door 3 if the prize is behind door 2.
Application 3: Prosecutor's Fallacy
In forensic science, confusing $P(\text{evidence} \mid \text{innocent})$ with$P(\text{innocent} \mid \text{evidence})$ is called the prosecutor's fallacy. If a DNA match occurs with probability $10^{-6}$ for an innocent person, this does not mean there is a one-in-a-million chance the defendant is innocent. By Bayes' theorem, the posterior depends critically on the prior probability of guilt, which depends on other evidence.
Application 4: Machine Learning and Classification
In Bayesian classification, we assign a new observation $\mathbf{x}$ to the class$c^*$ that maximizes the posterior:
This framework encompasses Naive Bayes, linear discriminant analysis, quadratic discriminant analysis, and Gaussian process classifiers. The choice of likelihood model$P(\mathbf{x} \mid C = c)$ determines the classifier's decision boundaries.
2.7 Python Simulation
This simulation demonstrates Bayes' theorem through the medical testing example, the Monty Hall problem, and sequential Bayesian updating.
Conditional Probability: Bayes Theorem and Bayesian Updating
PythonClick Run to execute the Python code
Code will be executed with Python 3 on the server
2.8 Summary and Key Takeaways
Conditional Probability
$P(A \mid B) = P(A \cap B) / P(B)$ defines a new probability measure on$(\Omega, \mathcal{F})$. It represents our updated belief about $A$after learning $B$.
Bayes' Theorem
The posterior is proportional to the likelihood times the prior. This is the foundation of all Bayesian reasoning and statistical inference.
Independence
Events are independent when $P(A \cap B) = P(A)P(B)$. Mutual independence is stronger than pairwise independence and requires $2^n - n - 1$ conditions.
Bayesian Updating
Sequential application of Bayes' theorem allows beliefs to be continuously updated as new data arrives, with the posterior becoming the prior for the next update.
Common Pitfalls
The base rate fallacy and prosecutor's fallacy arise from confusing$P(A \mid B)$ with $P(B \mid A)$. Always apply Bayes' theorem explicitly.
Practice Problems
Problem 1:A medical test for a disease has sensitivity 95% (true positive rate) and specificity 98% (true negative rate). If the disease prevalence is 0.1%, what is the probability a person who tests positive actually has the disease?
Solution:
1. Let $D$ = disease, $T^+$ = positive test. Given: $P(T^+ \mid D) = 0.95$, $P(T^- \mid D^c) = 0.98$, $P(D) = 0.001$.
2. Apply Bayes' theorem: $P(D \mid T^+) = \frac{P(T^+ \mid D)P(D)}{P(T^+)}$.
3. Total probability of positive test: $P(T^+) = P(T^+ \mid D)P(D) + P(T^+ \mid D^c)P(D^c)$.
4. $P(T^+) = 0.95 \times 0.001 + 0.02 \times 0.999 = 0.00095 + 0.01998 = 0.02093$.
5. $P(D \mid T^+) = \frac{0.00095}{0.02093} = 0.0454$ or about 4.5%.
6. Despite the excellent test (95%/98%), only 4.5% of positives are true positives! This is the base rate fallacy: the low prevalence means false positives vastly outnumber true positives. This is why confirmatory testing is essential.
Problem 2:In the Monty Hall problem, you choose door 1, the host opens door 3 (showing a goat). Should you switch to door 2? Prove the answer using Bayes' theorem.
Solution:
1. Prior probabilities: $P(C_i) = 1/3$ for each door $i = 1, 2, 3$ (car behind door $i$).
2. Let $H_3$ = host opens door 3. Likelihoods: $P(H_3 \mid C_1) = 1/2$ (host chooses randomly between doors 2 and 3); $P(H_3 \mid C_2) = 1$ (host must open door 3); $P(H_3 \mid C_3) = 0$ (host never reveals car).
3. $P(H_3) = P(H_3 \mid C_1)(1/3) + P(H_3 \mid C_2)(1/3) + P(H_3 \mid C_3)(1/3) = 1/6 + 1/3 + 0 = 1/2$.
4. Posterior: $P(C_1 \mid H_3) = \frac{(1/2)(1/3)}{1/2} = 1/3$.
5. $P(C_2 \mid H_3) = \frac{(1)(1/3)}{1/2} = 2/3$.
6. Switching to door 2 gives probability $2/3$ of winning vs. $1/3$ for staying. You should always switch. The key insight: the host's action provides information that concentrates the probability of the other two doors onto the remaining one.
Problem 3:In a room of $n$ people, what is the probability that at least two share a birthday? Find the smallest $n$ for which this exceeds 50%.
Solution:
1. It is easier to compute the complement: $P(\text{all different}) = \frac{365}{365} \cdot \frac{364}{365} \cdot \frac{363}{365} \cdots \frac{365-n+1}{365}$.
2. $P(\text{all different}) = \prod_{k=0}^{n-1}\left(1 - \frac{k}{365}\right)$.
3. $P(\text{at least one match}) = 1 - P(\text{all different})$.
4. Using the approximation $\ln(1-x) \approx -x$: $\ln P(\text{all diff}) \approx -\sum_{k=0}^{n-1}\frac{k}{365} = -\frac{n(n-1)}{730}$.
5. Setting $P(\text{match}) = 0.5$: $\frac{n(n-1)}{730} = \ln 2$, so $n^2 \approx 730 \times 0.693 = 505.7$, giving $n \approx 22.5$.
6. The exact answer is $n = 23$: $P(\text{match}) = 50.7\%$. This counterintuitive result occurs because there are $\binom{23}{2} = 253$ pairs to check, and even small pairwise collision probabilities accumulate rapidly.
Problem 4:Events $A$ and $B$ satisfy $P(A) = 0.4$, $P(B) = 0.5$, $P(A \cap B) = 0.2$. Are $A$ and $B$ independent? Find $P(A \mid B)$ and $P(A \cup B)$.
Solution:
1. Test independence: $P(A)P(B) = 0.4 \times 0.5 = 0.2 = P(A \cap B)$.
2. Since $P(A \cap B) = P(A)P(B)$, $A$ and $B$ are independent.
3. $P(A \mid B) = \frac{P(A \cap B)}{P(B)} = \frac{0.2}{0.5} = 0.4 = P(A)$. Consistent with independence: knowing $B$ occurred doesn't change the probability of $A$.
4. $P(A \cup B) = P(A) + P(B) - P(A \cap B) = 0.4 + 0.5 - 0.2 = 0.7$.
5. Also verify: $P(B \mid A) = P(A \cap B)/P(A) = 0.2/0.4 = 0.5 = P(B)$. Symmetric consistency check.
6. Note: independence ($P(A \cap B) = P(A)P(B)$) is different from mutual exclusivity ($P(A \cap B) = 0$). Independent events with positive probabilities are never mutually exclusive, and vice versa.
Problem 5:A factory has three machines producing 30%, 45%, and 25% of output with defect rates 2%, 3%, and 5% respectively. A randomly chosen item is defective. What is the probability it came from machine 3? (Law of total probability + Bayes' theorem)
Solution:
1. Let $M_i$ = item from machine $i$, $D$ = defective. Given: $P(M_1) = 0.30$, $P(M_2) = 0.45$, $P(M_3) = 0.25$.
2. Defect rates: $P(D \mid M_1) = 0.02$, $P(D \mid M_2) = 0.03$, $P(D \mid M_3) = 0.05$.
3. Total probability: $P(D) = \sum_i P(D \mid M_i)P(M_i) = 0.02(0.30) + 0.03(0.45) + 0.05(0.25)$.
4. $P(D) = 0.006 + 0.0135 + 0.0125 = 0.032$ (3.2% overall defect rate).
5. By Bayes' theorem: $P(M_3 \mid D) = \frac{P(D \mid M_3)P(M_3)}{P(D)} = \frac{0.05 \times 0.25}{0.032} = \frac{0.0125}{0.032} = 0.391$.
6. Machine 3 produces only 25% of items but is responsible for 39.1% of defects due to its higher defect rate. Similarly, $P(M_1 \mid D) = 0.188$ and $P(M_2 \mid D) = 0.422$. Verify: $0.188 + 0.422 + 0.391 = 1.001 \approx 1$.