Probability Theory Refresher DRAFT
Note
This chapter authored by Todd M. Gureckis. See the license information for this book.
Introduction
Probability is at the heart of computational cognitive science. It provides the mathematical language for reasoning about uncertainty—and uncertainty is everywhere in cognition. When you hear a word in a noisy room, your brain must infer what was said from ambiguous acoustic signals. When you plan a route through a city, you reason about which paths are likely to be fastest. When you learn a new concept, you generalize from a few examples to make predictions about new cases.
The goal of these notes is to build up the key ideas of probability theory with an emphasis on intuition rather than formal rigor. We want you to understand why these concepts matter for cognitive science, not to prove measure-theoretic foundations.
By the end of this chapter you should be comfortable with the following ideas:
- Random variables: What they are and how they assign probabilities to outcomes
- Joint distributions: Probabilities over combinations of multiple variables
- Conditional distributions: How probabilities change when we have partial information
- Marginal distributions: Summing out variables we don't care about
- Bayes' rule: The fundamental equation for updating beliefs with evidence
- Continuous probability: How to handle real-valued quantities with probability density functions
- The Gaussian distribution: The most important continuous distribution in science
Let's get started.
Random Variables and Probability Distributions
A random variable is a mapping from a set of possible values (or outcomes) to probabilities. Each outcome has an associated probability, and all the probabilities must sum to 1.
Consider a random variable
| Weather | |
|---|---|
| sunny | 0.7 |
| cloudy | 0.2 |
| stormy | 0.1 |
This is called a probability distribution—it tells us how probability is distributed across the possible outcomes. Notice that the probabilities sum to 1:
Notation
The expression
In Python, we can represent this distribution as a simple function:
def P_W(weather):
if weather == "sunny":
return 0.7
if weather == "cloudy":
return 0.2
if weather == "stormy":
return 0.1We can verify our probabilities sum to 1:
total = sum(P_W(w) for w in ["sunny", "cloudy", "stormy"])
print(total) # 1.0Joint Distributions
A joint distribution assigns probabilities to combinations of two or more random variables. Instead of asking "what's the probability of sunny weather?" we can ask "what's the probability of sunny weather and heavy traffic?"
Let's introduce a second random variable
| Weather | Traffic | |
|---|---|---|
| sunny | yes | 0.1 |
| cloudy | yes | 0.1 |
| stormy | yes | 0.1 |
| sunny | no | 0.6 |
| cloudy | no | 0.1 |
| stormy | no | 0.0 |
Notice a few things:
- All probabilities still sum to 1:
- Some combinations are more likely than others—sunny with no traffic (0.6) is the most common
- Some combinations may have probability 0—here, stormy weather with no traffic never occurs
Why joint distributions matter
Joint distributions capture the relationships between variables. In this example, weather and traffic are related: stormy weather always comes with traffic (probability 1.0 given stormy), while sunny weather usually means no traffic. Understanding these relationships is crucial for reasoning about the world.
We can represent this in Python using a dictionary:
def P_WT(weather, traffic):
states = {
("sunny", "yes"): 0.1,
("cloudy", "yes"): 0.1,
("stormy", "yes"): 0.1,
("sunny", "no"): 0.6,
("cloudy", "no"): 0.1,
("stormy", "no"): 0.0,
}
return states[(weather, traffic)]Conditional Distributions
What if we know something about one variable and want to reason about another? For example, if I look outside and see that it's stormy, what's the probability there's traffic?
This is answered by a conditional distribution: the distribution of one variable given (or conditioned on) another. We write this as
The key insight is the product rule:
This says: the probability of a specific weather-traffic combination equals the probability of that weather times the probability of that traffic given that weather.
Rearranging, we get a formula for conditional probability:
Intuition for the formula
Think of it this way:
Let's compute the conditional distribution
def P_T_given_W(traffic, weather):
return P_WT(weather, traffic) / P_W(weather)
# Check the probabilities
for weather in ["sunny", "cloudy", "stormy"]:
for traffic in ["yes", "no"]:
prob = P_T_given_W(traffic, weather)
print(f"P(T={traffic} | W={weather}) = {prob:.3f}")This gives us:
| Weather | Traffic | |
|---|---|---|
| sunny | yes | 0.143 |
| cloudy | yes | 0.500 |
| stormy | yes | 1.000 |
| sunny | no | 0.857 |
| cloudy | no | 0.500 |
| stormy | no | 0.000 |
Now we can answer our original question: given stormy weather, the probability of traffic is 1.0 (100%). Conversely, given sunny weather, traffic is unlikely (only 14.3%).
Notice that for each weather condition, the probabilities of traffic=yes and traffic=no sum to 1. This is because
Marginal Distributions
Sometimes we have a joint distribution but only care about one of the variables. For example, suppose we want to know the overall probability of traffic on any given day, regardless of weather. This is called a marginal distribution.
We obtain a marginal distribution by marginalizing out (or summing out) the variable we don't care about:
Expanding this sum:
For traffic=yes:
For traffic=no:
So the marginal distribution of traffic is:
| Traffic | |
|---|---|
| yes | 0.3 |
| no | 0.7 |
In Python:
def P_T(traffic):
return sum(P_WT(weather, traffic)
for weather in ["sunny", "cloudy", "stormy"])Why "marginal"?
The name comes from a historical convention of writing joint distributions as tables and computing the single-variable distributions by summing rows or columns—these sums were written in the margins of the table.
When to Add vs. Multiply Probabilities
A common source of confusion is knowing when to add probabilities and when to multiply them. Here's the rule:
Add probabilities when you want "this OR that" for mutually exclusive (non-overlapping) events:
Multiply probabilities when you want "this AND that" for independent events:
Independence
Two events are independent if knowing one tells you nothing about the other. Formally,
For events that are not independent, we use the product rule instead:
Bayes' Rule
Bayes' rule is perhaps the most important equation in probabilistic reasoning. It tells us how to update our beliefs when we observe new evidence.
Suppose we have a set of hypotheses
Let's break down each term:
— the prior: our belief in hypothesis before seeing the data — the likelihood: how likely we'd see this data if hypothesis were true — the marginal likelihood or evidence: the overall probability of the data (summing over all hypotheses) — the posterior: our updated belief in hypothesis after seeing the data
The core insight
Bayes' rule says that our posterior belief is proportional to prior belief times likelihood. Hypotheses that were already plausible (high prior) and that predict the observed data well (high likelihood) become more probable after seeing the data.
Example: Medical Diagnosis
Let's work through a classic example. A patient coughs. What's the probability they have lung cancer?
Hypotheses:
: The patient is healthy : The patient has a cold : The patient has lung cancer
Prior probabilities (from general population statistics):
— most people are healthy — colds are somewhat common — lung cancer is rare
Likelihoods (probability of coughing given each condition):
— healthy people rarely cough — people with colds often cough — lung cancer patients almost always cough
Now we apply Bayes' rule. First, compute the numerator (likelihood × prior) for each hypothesis:
Next, compute the marginal probability of the data by summing these:
Finally, divide to get the posteriors:
(14% chance healthy) (70% chance cold) (15% chance lung cancer)
The most likely explanation is a cold (70%), not lung cancer (15%)—even though lung cancer patients almost always cough. Why? Because colds are much more common than lung cancer. The prior matters.
Base rate neglect
Many people intuitively think that if lung cancer causes coughing 99% of the time, then coughing must strongly indicate lung cancer. This is called base rate neglect—ignoring how rare lung cancer is in the first place. Bayes' rule correctly accounts for both the likelihood and the prior.
Implementing Bayes' Rule in Python
We can implement Bayes' rule with just a few lines of code:
import numpy as np
def normalize(x):
"""Normalize an array to sum to 1."""
return x / np.sum(x)
def posterior(prior, likelihood):
"""Compute the posterior using Bayes' rule."""
unnormalized = likelihood * prior
return normalize(unnormalized)Using our medical example:
prior = np.array([0.90, 0.09, 0.01])
likelihood = np.array([0.01, 0.50, 0.99])
post = posterior(prior, likelihood)
print(f"P(Healthy | Cough) = {post[0]:.3f}")
print(f"P(Cold | Cough) = {post[1]:.3f}")
print(f"P(LungCancer | Cough) = {post[2]:.3f}")Inverting Conditional Probabilities
Bayes' rule is particularly useful when we know
Bayes' rule lets us "invert" the conditional:
hypotheses = ["sunny", "cloudy", "stormy"]
# We observe: no traffic
traffic = "no"
# Prior: P(W)
prior = np.array([P_W(w) for w in hypotheses])
# Likelihood: P(T=no | W)
likelihood = np.array([P_T_given_W(traffic, w) for w in hypotheses])
# Posterior: P(W | T=no)
post = posterior(prior, likelihood)
for i, weather in enumerate(hypotheses):
print(f"P(W={weather} | no traffic) = {post[i]:.3f}")This gives us:
Given no traffic, it's almost certainly sunny.
Continuous Probability
So far we've dealt with discrete random variables—variables that take values from a finite set (like weather = {sunny, cloudy, stormy}). But many quantities we care about are continuous—they can take any real value. Examples include:
- The exact time someone takes to respond in an experiment
- The angle of a robot's arm
- The intensity of a pixel in an image
From Discrete to Continuous
Consider a discrete random variable
| State | |
|---|---|
| 1 | 1/6 |
| 2 | 1/12 |
| 3 | 1/6 |
| 4 | 1/3 |
| 5 | 1/12 |
| 6 | 1/6 |
Now imagine we measure position more precisely, distinguishing states 1a and 1b within what was state 1. The probability mass that was assigned to state 1 must now be split between 1a and 1b. Each individual state has less probability.
As we make our measurements more and more precise—dividing states into finer and finer subdivisions—each individual state's probability shrinks. In the limit of infinitely precise measurement (continuous position), the probability of any exact value becomes zero.
Key insight
As the number of possible outcomes increases toward infinity, the probability of any single outcome shrinks toward zero.
This seems like a problem: if
Probability Density Functions
The solution is to talk about the probability that
The PDF
This integral computes the area under the PDF curve between
Density vs. probability
The height of a PDF at a point is not a probability—it's a density. You need to integrate (compute area) to get an actual probability. This is why PDF values can be greater than 1, as long as the total area under the curve equals 1.
The Gaussian (Normal) Distribution
The most important continuous distribution in science is the Gaussian or normal distribution. It's the famous bell curve:
The distribution has two parameters:
(mu): the mean, the center of the bell curve (sigma): the standard deviation, which controls the width
When
In Python:
import numpy as np
def gaussian(x, mu, sigma):
"""Evaluate the Gaussian PDF at x."""
return (np.exp(-((x - mu)**2) / (2 * sigma**2)) /
(np.sqrt(2 * np.pi) * sigma))Computing Probabilities from PDFs
To find the probability that a normally distributed variable falls between 0 and 1, we integrate the Gaussian PDF over that range:
This integral doesn't have a simple closed form, but we can compute it numerically using the cumulative distribution function (CDF). The CDF
To find
In Python:
from scipy.stats import norm
# P(0 <= X <= 1) for a standard normal
prob = norm.cdf(1) - norm.cdf(0)
print(f"P(0 <= X <= 1) = {prob:.3f}") # ≈ 0.341The 68-95-99.7 rule
For a normal distribution:
- About 68% of values fall within 1 standard deviation of the mean
- About 95% fall within 2 standard deviations
- About 99.7% fall within 3 standard deviations
This is useful for quick mental estimates.
Why Gaussians Are Everywhere
The Gaussian distribution shows up constantly in science for a deep reason: the Central Limit Theorem. This theorem says that when you add up many independent random variables—regardless of their individual distributions—the sum tends toward a Gaussian.
Many natural quantities result from the combination of many small, independent factors:
- Your height is influenced by thousands of genes, nutrition, environment, etc.
- Measurement error results from many small sources of noise
- Neural firing rates result from inputs from many neurons
When many independent factors combine additively, the result is approximately Gaussian.
Putting It All Together
Let's summarize the key concepts:
Random variables assign probabilities to outcomes. Joint distributions give probabilities for combinations of variables. Conditional distributions tell us how one variable depends on another. Marginal distributions are obtained by summing out variables we don't need.
Bayes' rule is the fundamental equation for updating beliefs:
It tells us how to combine prior beliefs with evidence to form posterior beliefs.
For continuous variables, we use probability density functions. The Gaussian distribution is particularly important due to the Central Limit Theorem.
These concepts form the foundation for probabilistic models of cognition. Whether we're modeling how people learn categories, make decisions under uncertainty, or perceive ambiguous stimuli, probability theory provides the mathematical framework.
Learn More
If you'd like to go deeper into probability theory, here are some excellent resources:
- 3Blue1Brown: Probability — Visual explanations of probability concepts including Bayes' theorem.
- Seeing Theory — A beautiful interactive introduction to probability and statistics.
- Khan Academy: Probability — Comprehensive video lessons on probability fundamentals.
- StatQuest: Probability — Clear, simple explanations of probability and statistics concepts.
- MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. Available free at inference.org.uk/mackay/itila/. Chapters 2–3 cover probability fundamentals beautifully.