Skip to content

Probability Theory Refresher DRAFT

Note

This chapter authored by Todd M. Gureckis. See the license information for this book.

Introduction

Probability is at the heart of computational cognitive science. It provides the mathematical language for reasoning about uncertainty—and uncertainty is everywhere in cognition. When you hear a word in a noisy room, your brain must infer what was said from ambiguous acoustic signals. When you plan a route through a city, you reason about which paths are likely to be fastest. When you learn a new concept, you generalize from a few examples to make predictions about new cases.

The goal of these notes is to build up the key ideas of probability theory with an emphasis on intuition rather than formal rigor. We want you to understand why these concepts matter for cognitive science, not to prove measure-theoretic foundations.

By the end of this chapter you should be comfortable with the following ideas:

  • Random variables: What they are and how they assign probabilities to outcomes
  • Joint distributions: Probabilities over combinations of multiple variables
  • Conditional distributions: How probabilities change when we have partial information
  • Marginal distributions: Summing out variables we don't care about
  • Bayes' rule: The fundamental equation for updating beliefs with evidence
  • Continuous probability: How to handle real-valued quantities with probability density functions
  • The Gaussian distribution: The most important continuous distribution in science

Let's get started.

Random Variables and Probability Distributions

A random variable is a mapping from a set of possible values (or outcomes) to probabilities. Each outcome has an associated probability, and all the probabilities must sum to 1.

Consider a random variable W that describes the weather on a given day. There are three possible outcomes: sunny, cloudy, or stormy. We assign probabilities to each:

WeatherP(W=weather)
sunny0.7
cloudy0.2
stormy0.1

This is called a probability distribution—it tells us how probability is distributed across the possible outcomes. Notice that the probabilities sum to 1: 0.7+0.2+0.1=1.0.

Notation

The expression P(W=sunny) means "the probability that the random variable W takes the value sunny." We often abbreviate this to just P(sunny) when the random variable is clear from context.

In Python, we can represent this distribution as a simple function:

python
def P_W(weather):
    if weather == "sunny":
        return 0.7
    if weather == "cloudy":
        return 0.2
    if weather == "stormy":
        return 0.1

We can verify our probabilities sum to 1:

python
total = sum(P_W(w) for w in ["sunny", "cloudy", "stormy"])
print(total)  # 1.0

Joint Distributions

A joint distribution assigns probabilities to combinations of two or more random variables. Instead of asking "what's the probability of sunny weather?" we can ask "what's the probability of sunny weather and heavy traffic?"

Let's introduce a second random variable T for whether there is traffic (yes or no). The joint distribution P(W,T) assigns probabilities to every combination of weather and traffic:

WeatherTrafficP(W,T)
sunnyyes0.1
cloudyyes0.1
stormyyes0.1
sunnyno0.6
cloudyno0.1
stormyno0.0

Notice a few things:

  1. All probabilities still sum to 1: 0.1+0.1+0.1+0.6+0.1+0.0=1.0
  2. Some combinations are more likely than others—sunny with no traffic (0.6) is the most common
  3. Some combinations may have probability 0—here, stormy weather with no traffic never occurs

Why joint distributions matter

Joint distributions capture the relationships between variables. In this example, weather and traffic are related: stormy weather always comes with traffic (probability 1.0 given stormy), while sunny weather usually means no traffic. Understanding these relationships is crucial for reasoning about the world.

We can represent this in Python using a dictionary:

python
def P_WT(weather, traffic):
    states = {
        ("sunny", "yes"):  0.1,
        ("cloudy", "yes"): 0.1,
        ("stormy", "yes"): 0.1,
        ("sunny", "no"):   0.6,
        ("cloudy", "no"):  0.1,
        ("stormy", "no"):  0.0,
    }
    return states[(weather, traffic)]

Conditional Distributions

What if we know something about one variable and want to reason about another? For example, if I look outside and see that it's stormy, what's the probability there's traffic?

This is answered by a conditional distribution: the distribution of one variable given (or conditioned on) another. We write this as P(T|W)—the probability of traffic given the weather.

The key insight is the product rule:

P(W,T)=P(T|W)P(W)

This says: the probability of a specific weather-traffic combination equals the probability of that weather times the probability of that traffic given that weather.

Rearranging, we get a formula for conditional probability:

P(T|W)=P(W,T)P(W)

Intuition for the formula

Think of it this way: P(T|W) is the probability of T in the "world" where W is fixed. We're zooming in on just those outcomes where W occurs, and asking what fraction of them also have T. The denominator P(W) renormalizes so probabilities sum to 1 in this restricted world.

Let's compute the conditional distribution P(T|W) for our weather/traffic example:

python
def P_T_given_W(traffic, weather):
    return P_WT(weather, traffic) / P_W(weather)

# Check the probabilities
for weather in ["sunny", "cloudy", "stormy"]:
    for traffic in ["yes", "no"]:
        prob = P_T_given_W(traffic, weather)
        print(f"P(T={traffic} | W={weather}) = {prob:.3f}")

This gives us:

WeatherTrafficP(T|W)
sunnyyes0.143
cloudyyes0.500
stormyyes1.000
sunnyno0.857
cloudyno0.500
stormyno0.000

Now we can answer our original question: given stormy weather, the probability of traffic is 1.0 (100%). Conversely, given sunny weather, traffic is unlikely (only 14.3%).

Notice that for each weather condition, the probabilities of traffic=yes and traffic=no sum to 1. This is because P(T|W) is a proper probability distribution over T for each fixed value of W.

Marginal Distributions

Sometimes we have a joint distribution but only care about one of the variables. For example, suppose we want to know the overall probability of traffic on any given day, regardless of weather. This is called a marginal distribution.

We obtain a marginal distribution by marginalizing out (or summing out) the variable we don't care about:

P(T)=weatherP(W=weather,T)

Expanding this sum:

P(T)=P(W=sunny,T)+P(W=cloudy,T)+P(W=stormy,T)

For traffic=yes: P(T=yes)=0.1+0.1+0.1=0.3

For traffic=no: P(T=no)=0.6+0.1+0.0=0.7

So the marginal distribution of traffic is:

TrafficP(T)
yes0.3
no0.7

In Python:

python
def P_T(traffic):
    return sum(P_WT(weather, traffic)
               for weather in ["sunny", "cloudy", "stormy"])

Why "marginal"?

The name comes from a historical convention of writing joint distributions as tables and computing the single-variable distributions by summing rows or columns—these sums were written in the margins of the table.

When to Add vs. Multiply Probabilities

A common source of confusion is knowing when to add probabilities and when to multiply them. Here's the rule:

Add probabilities when you want "this OR that" for mutually exclusive (non-overlapping) events:

P(cloudy or stormy)=P(cloudy)+P(stormy)=0.2+0.1=0.3

Multiply probabilities when you want "this AND that" for independent events:

P(A and B)=P(A)×P(B)(only if A and B are independent)

Independence

Two events are independent if knowing one tells you nothing about the other. Formally, A and B are independent if P(A|B)=P(A). In our weather/traffic example, weather and traffic are not independent—knowing the weather changes our beliefs about traffic.

For events that are not independent, we use the product rule instead:

P(A and B)=P(A)×P(B|A)

Bayes' Rule

Bayes' rule is perhaps the most important equation in probabilistic reasoning. It tells us how to update our beliefs when we observe new evidence.

Suppose we have a set of hypotheses h (possible explanations for the world) and we observe some data d (evidence). Bayes' rule tells us how to compute the probability of each hypothesis given the data:

P(h|d)=P(d|h)P(h)P(d)

Let's break down each term:

  • P(h) — the prior: our belief in hypothesis h before seeing the data
  • P(d|h) — the likelihood: how likely we'd see this data if hypothesis h were true
  • P(d) — the marginal likelihood or evidence: the overall probability of the data (summing over all hypotheses)
  • P(h|d) — the posterior: our updated belief in hypothesis h after seeing the data

The core insight

Bayes' rule says that our posterior belief is proportional to prior belief times likelihood. Hypotheses that were already plausible (high prior) and that predict the observed data well (high likelihood) become more probable after seeing the data.

Example: Medical Diagnosis

Let's work through a classic example. A patient coughs. What's the probability they have lung cancer?

Hypotheses:

  1. h1: The patient is healthy
  2. h2: The patient has a cold
  3. h3: The patient has lung cancer

Prior probabilities (from general population statistics):

  • P(h1)=0.90 — most people are healthy
  • P(h2)=0.09 — colds are somewhat common
  • P(h3)=0.01 — lung cancer is rare

Likelihoods (probability of coughing given each condition):

  • P(cough|h1)=0.01 — healthy people rarely cough
  • P(cough|h2)=0.50 — people with colds often cough
  • P(cough|h3)=0.99 — lung cancer patients almost always cough

Now we apply Bayes' rule. First, compute the numerator (likelihood × prior) for each hypothesis:

  • P(cough|h1)P(h1)=0.01×0.90=0.009
  • P(cough|h2)P(h2)=0.50×0.09=0.045
  • P(cough|h3)P(h3)=0.99×0.01=0.0099

Next, compute the marginal probability of the data by summing these:

P(cough)=0.009+0.045+0.0099=0.0639

Finally, divide to get the posteriors:

  • P(h1|cough)=0.009/0.06390.14 (14% chance healthy)
  • P(h2|cough)=0.045/0.06390.70 (70% chance cold)
  • P(h3|cough)=0.0099/0.06390.15 (15% chance lung cancer)

The most likely explanation is a cold (70%), not lung cancer (15%)—even though lung cancer patients almost always cough. Why? Because colds are much more common than lung cancer. The prior matters.

Base rate neglect

Many people intuitively think that if lung cancer causes coughing 99% of the time, then coughing must strongly indicate lung cancer. This is called base rate neglect—ignoring how rare lung cancer is in the first place. Bayes' rule correctly accounts for both the likelihood and the prior.

Implementing Bayes' Rule in Python

We can implement Bayes' rule with just a few lines of code:

python
import numpy as np

def normalize(x):
    """Normalize an array to sum to 1."""
    return x / np.sum(x)

def posterior(prior, likelihood):
    """Compute the posterior using Bayes' rule."""
    unnormalized = likelihood * prior
    return normalize(unnormalized)

Using our medical example:

python
prior = np.array([0.90, 0.09, 0.01])
likelihood = np.array([0.01, 0.50, 0.99])

post = posterior(prior, likelihood)
print(f"P(Healthy | Cough) = {post[0]:.3f}")
print(f"P(Cold | Cough) = {post[1]:.3f}")
print(f"P(LungCancer | Cough) = {post[2]:.3f}")

Inverting Conditional Probabilities

Bayes' rule is particularly useful when we know P(d|h) but want P(h|d). In our weather example, we computed P(T|W)—the probability of traffic given weather. But suppose you're in a windowless office and a friend texts you that there's no traffic. Now you want to infer the weather: P(W|T).

Bayes' rule lets us "invert" the conditional:

python
hypotheses = ["sunny", "cloudy", "stormy"]

# We observe: no traffic
traffic = "no"

# Prior: P(W)
prior = np.array([P_W(w) for w in hypotheses])

# Likelihood: P(T=no | W)
likelihood = np.array([P_T_given_W(traffic, w) for w in hypotheses])

# Posterior: P(W | T=no)
post = posterior(prior, likelihood)

for i, weather in enumerate(hypotheses):
    print(f"P(W={weather} | no traffic) = {post[i]:.3f}")

This gives us:

  • P(sunny|no traffic)0.857
  • P(cloudy|no traffic)0.143
  • P(stormy|no traffic)=0.0

Given no traffic, it's almost certainly sunny.

Continuous Probability

So far we've dealt with discrete random variables—variables that take values from a finite set (like weather = {sunny, cloudy, stormy}). But many quantities we care about are continuous—they can take any real value. Examples include:

  • The exact time someone takes to respond in an experiment
  • The angle of a robot's arm
  • The intensity of a pixel in an image

From Discrete to Continuous

Consider a discrete random variable X representing a robot's position. If the robot can only be in 6 states, each state might have probability around 1/6:

StateP(X)
11/6
21/12
31/6
41/3
51/12
61/6

Now imagine we measure position more precisely, distinguishing states 1a and 1b within what was state 1. The probability mass that was assigned to state 1 must now be split between 1a and 1b. Each individual state has less probability.

As we make our measurements more and more precise—dividing states into finer and finer subdivisions—each individual state's probability shrinks. In the limit of infinitely precise measurement (continuous position), the probability of any exact value becomes zero.

Key insight

As the number of possible outcomes increases toward infinity, the probability of any single outcome shrinks toward zero.

This seems like a problem: if P(X=3.14159...)=0 for any specific value, how can we do probability at all?

Probability Density Functions

The solution is to talk about the probability that X falls within a range of values rather than at an exact value. We use a probability density function (PDF) instead of a probability mass function.

The PDF f(x) doesn't give probabilities directly. Instead, the probability that X falls between values a and b is given by the area under the curve:

P(aXb)=abf(x)dx

This integral computes the area under the PDF curve between a and b.

Density vs. probability

The height of a PDF at a point is not a probability—it's a density. You need to integrate (compute area) to get an actual probability. This is why PDF values can be greater than 1, as long as the total area under the curve equals 1.

The Gaussian (Normal) Distribution

The most important continuous distribution in science is the Gaussian or normal distribution. It's the famous bell curve:

f(x)=12πσexp{(xμ)22σ2}

The distribution has two parameters:

  • μ (mu): the mean, the center of the bell curve
  • σ (sigma): the standard deviation, which controls the width

When μ=0 and σ=1, we have the standard normal distribution.

In Python:

python
import numpy as np

def gaussian(x, mu, sigma):
    """Evaluate the Gaussian PDF at x."""
    return (np.exp(-((x - mu)**2) / (2 * sigma**2)) /
            (np.sqrt(2 * np.pi) * sigma))

Computing Probabilities from PDFs

To find the probability that a normally distributed variable falls between 0 and 1, we integrate the Gaussian PDF over that range:

P(0X1)=0112πex2/2dx

This integral doesn't have a simple closed form, but we can compute it numerically using the cumulative distribution function (CDF). The CDF F(x) gives the probability that Xx:

F(x)=P(Xx)=xf(t)dt

To find P(aXb), we compute F(b)F(a).

In Python:

python
from scipy.stats import norm

# P(0 <= X <= 1) for a standard normal
prob = norm.cdf(1) - norm.cdf(0)
print(f"P(0 <= X <= 1) = {prob:.3f}")  # ≈ 0.341

The 68-95-99.7 rule

For a normal distribution:

  • About 68% of values fall within 1 standard deviation of the mean
  • About 95% fall within 2 standard deviations
  • About 99.7% fall within 3 standard deviations

This is useful for quick mental estimates.

Why Gaussians Are Everywhere

The Gaussian distribution shows up constantly in science for a deep reason: the Central Limit Theorem. This theorem says that when you add up many independent random variables—regardless of their individual distributions—the sum tends toward a Gaussian.

Many natural quantities result from the combination of many small, independent factors:

  • Your height is influenced by thousands of genes, nutrition, environment, etc.
  • Measurement error results from many small sources of noise
  • Neural firing rates result from inputs from many neurons

When many independent factors combine additively, the result is approximately Gaussian.

Putting It All Together

Let's summarize the key concepts:

Random variables assign probabilities to outcomes. Joint distributions give probabilities for combinations of variables. Conditional distributions tell us how one variable depends on another. Marginal distributions are obtained by summing out variables we don't need.

Bayes' rule is the fundamental equation for updating beliefs:

P(h|d)=P(d|h)P(h)P(d)

It tells us how to combine prior beliefs with evidence to form posterior beliefs.

For continuous variables, we use probability density functions. The Gaussian distribution is particularly important due to the Central Limit Theorem.

These concepts form the foundation for probabilistic models of cognition. Whether we're modeling how people learn categories, make decisions under uncertainty, or perceive ambiguous stimuli, probability theory provides the mathematical framework.

Learn More

If you'd like to go deeper into probability theory, here are some excellent resources:

  • 3Blue1Brown: Probability — Visual explanations of probability concepts including Bayes' theorem.
  • Seeing Theory — A beautiful interactive introduction to probability and statistics.
  • Khan Academy: Probability — Comprehensive video lessons on probability fundamentals.
  • StatQuest: Probability — Clear, simple explanations of probability and statistics concepts.
  • MacKay, D. J. C. (2003). Information Theory, Inference, and Learning Algorithms. Cambridge University Press. Available free at inference.org.uk/mackay/itila/. Chapters 2–3 cover probability fundamentals beautifully.