Math for Neural Networks DRAFT
Note
This chapter authored by Todd M. Gureckis. See the license information for this book. The draft chapter means it should not be considered complete or accurate!!
Introduction
Neural networks are one of the most exciting and impactful ideas in cognitive science and artificial intelligence. However, understanding how they work requires some mathematical background that might not be shared by all students.
The goal of these notes is to build up the key ideas you need with an emphasis on intuition rather than formal rigor. We want you to understand why these math concepts matter for neural networks.
The chapter follows a natural progression: we start with calculus fundamentals—derivatives—that let us measure how changing inputs affects outputs. We then introduce the perceptron (the simplest neural network) and show how to measure its errors with a loss function. With a concrete model in hand, we extend derivatives to gradients (for functions with many inputs) and introduce gradient descent, the core optimization strategy for training. Along the way, we'll see why single-layer networks have fundamental limits and how multi-layer networks overcome them. The chain rule and backpropagation show how gradients flow backward through layers of computation. Finally, linear algebra (vectors and matrices) reveals how all of this scales efficiently to large networks.
Let's get started.
Functions and Derivatives
A function is simply a rule that takes an input and produces an output. For example, the function
Explore different functions by selecting them from the dropdown. Notice how each function maps inputs to outputs in a different way.
In the context of neural networks, functions are everywhere. Each layer of a neural network applies a function to its inputs to produce outputs. The whole network is one big function that maps inputs (like pixel values of an image) to outputs (like the probability that the image contains a cat).
A derivative tells you how much the output of a function changes when you make a tiny change to the input. More precisely, the derivative of
Derivative intuition
Imagine you are hiking on a hill and your altitude is given by the function
For the function
- At
, the slope is —the function is increasing steeply. - At
, the slope is —the function is flat at its minimum. - At
, the slope is —the function is decreasing.
You can explore this intuition below. Drag your mouse of the curve to see the slope at the point. The formula for the derivative is also displayed showing how it is computed.
Interactive derivative explorer. Drag your mouse over the curve to see the slope (derivative) at each point. The tangent line shows the direction the function is heading.
Why do derivatives matter for neural networks? Because training a neural network means finding the settings (called weights) that make the network perform well. To find good weights, we need to know how changing a weight affects the network's performance. That's exactly what a derivative tells us.
Let's make this concrete by introducing the simplest neural network.
The Perceptron
The perceptron is the simplest possible neural network. It has some inputs, a set of weights, a bias, and produces a single output.
Here's what it looks like with two inputs:
A perceptron with two inputs (
The perceptron computes its output in two steps:
Weighted sum: Multiply each input by its corresponding weight, add them up, and add the bias:
Activation function: Pass this sum through an activation function
to produce the output:
The activation function we'll use is the sigmoid (also called the logistic function):
Why use an activation function?
The sigmoid squashes any input into a value between 0 and 1. This is useful for several reasons:
- It lets us interpret the output as a probability
- It introduces nonlinearity, which is essential for learning complex patterns
- It's differentiable everywhere, which we need for gradient descent
The sigmoid isn't the only activation function. The original perceptron used a simple threshold (or step) function: output 1 if the weighted sum exceeds a threshold, otherwise output 0. Other common choices include ReLU (rectified linear unit) and tanh:
Common activation functions used in neural networks. Each has different properties: sigmoid outputs values between 0 and 1, tanh outputs between -1 and 1, and ReLU outputs zero for negative inputs and the input itself for positive inputs.
You can explore how a perceptron with a threshold activation function works in the interactive Neuron Sandbox.
Here's a summary of the notation:
| Symbol | Meaning |
|---|---|
| Input values | |
| Weights (what we learn) | |
| Bias (also learned) | |
| Activation function (sigmoid) | |
| Predicted output | |
| Target output (what we want) |
So the full computation is:
This single equation describes everything the perceptron does. Given inputs
The goal of training is to find values for
Computing the Error (or Loss) of a Perceptron Network
Now that we have a perceptron, we need a way to measure how wrong its predictions are. This measurement is called the error or loss.
Let's train our perceptron to compute the logical OR function—the same example from the lecture. OR should output 1 if either (or both) inputs are 1, and 0 only when both inputs are 0:
| Target output ( | ||
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |
When we feed each input pattern through the perceptron, it computes
Early in training, with random weights, the predictions will be wrong:
| Target ( | Prediction ( | Error ( | Squared error | ||
|---|---|---|---|---|---|
| 0 | 0 | 0 | 0.3 | -0.3 | 0.09 |
| 0 | 1 | 1 | 0.4 | 0.6 | 0.36 |
| 1 | 0 | 1 | 0.5 | 0.5 | 0.25 |
| 1 | 1 | 1 | 0.6 | 0.4 | 0.16 |
Notice that we square the errors. This is common practice for two reasons: (1) it makes all errors positive, so they don't cancel out, and (2) it penalizes large errors more heavily than small ones.
The total error is the sum of the individual squared errors across all patterns:
For our example:
This single number summarizes how well (or poorly) the network is doing across all four training patterns. When we train the network, we're trying to adjust the weights to make this total error as small as possible.
Why sum all the errors?
You might wonder why we add up errors across all patterns rather than handling them one at a time. The sum gives us a single objective to optimize—one number that captures overall performance for all the patterns.
Partial Derivatives and Gradients
So far, we've talked about functions and derivatives of a single variable. But neural networks have many parameters. The perceptron example has 3 (
Consider a function
We write partial derivatives using the symbol
For our example:
— how changes as varies (holding fixed) — how changes as varies (holding fixed)
The gradient is the vector of all partial derivatives:
The symbol
For our example,
Geometric intuition
Think of
Try exploring the gradient demo below. Drag your mouse over the contour map to see the gradient vector (red arrow) pointing uphill. The panel on the left shows the function in 3D. The little arrows at the bottom allow you to rotate the 3d view. The panel on the left you can move you mouse to see the gradient at each
Try dragging around the circular contours in the bowl shape—notice how the gradient always points outward, up the sides of the bowl. Then try the other shapes and check your intuition about which direction the gradient should point.
Interactive gradient explorer. The left panel shows a 3D surface; the right panel shows contour lines (like a topographic map). Drag your mouse to see the gradient vector (red arrow) at each point—it always points uphill.
In a neural network, the "landscape" is the loss function—a measure of how badly the network is performing (like the squared error we just defined). The inputs to this landscape are all the weights of the network. The gradient tells us, for each weight, which direction to adjust it to reduce the loss.
Gradient Descent
Now we can put the pieces together. Gradient descent is an optimization algorithm used to train many kinds of models, including neural networks. The idea is simple:
- Compute the gradient of the loss function with respect to all the weights (parameters) of the model.
- Update each weight by taking a small step in the direction that reduces the loss.
Since the gradient points in the direction of steepest ascent, we move in the opposite direction—steepest descent—to reduce the loss. The update rule is:
Here:
represents one of the weights of the network is the error (or loss) function (the Greek letter "alpha") is the learning rate—a small positive number that controls the step size is the partial derivative of the error with respect to weight
So this is the "learning rule" for updating the weights. It says that we update each weight by taking a small step in the direction that reduces the error most steeply. Conceptually, this is not so different from a naive approach where you "wiggle" each weight a little bit and see if the error goes up or down. The gradient just tells us the answer mathematically—which direction to wiggle and by how much—without having to actually try both directions.
Since we update every weight in the network using this same rule, we can write all the updates at once using vector notation. If
Here
The learning rate
The learning rate
You can think of gradient descent as placing a ball on a hilly surface and letting it roll downhill. At each step, the ball moves in the steepest downhill direction. Eventually, it settles in a valley—a point where the loss is low.
Gradient descent is iterative: you repeat the process many times, each time nudging the weights a little bit to reduce the loss. Over many iterations, the network's performance improves and you reduce the loss (usually).
Try the interactive demo below to see gradient descent in action. Click Step to advance through each phase: computing the gradient, computing the update direction, applying the step, and checking the new loss.
Interactive gradient descent explorer. Click Step to advance through each phase: computing the gradient (red arrow), computing the update direction (green arrow), applying the step, and checking the new loss. The learning rate controls how far each step moves.
Gradient Descent in Action: Learning OR
Let's see gradient descent applied to our perceptron learning the OR function. Try the interactive demo below to watch a perceptron learn the OR function. Press Play to run gradient descent automatically, or use Step to advance one epoch at a time. Notice how the predictions start far from the targets, but after many epochs the predictions get closer and closer to the correct values as the weights adjust and the loss decreases.
As training progresses, the weights
| x₁ | x₂ | Target | Prediction | Error² |
|---|---|---|---|---|
| 0 | 0 | 0 | 0.37 | 0.139 |
| 0 | 1 | 1 | 0.31 | 0.475 |
| 1 | 0 | 1 | 0.21 | 0.619 |
| 1 | 1 | 1 | 0.17 | 0.688 |
A perceptron learning the OR function. Press Play to run gradient descent automatically, or Step to advance one epoch at a time. Watch how the predictions converge toward the target values as the weights adjust.
Neural networks as "differentiable computing"
Gradient descent only works for differentiable models—models where we can compute the gradient of the loss with respect to every parameter. This constraint is so fundamental that neural networks are sometimes described as differentiable computing: the art of building complex computations entirely from operations that have well-defined derivatives, so that gradients can flow through the entire system.
This is why neural networks use smooth activation functions (sigmoid, tanh, ReLU) rather than hard thresholds, and why operations like matrix multiplication are central—they're all differentiable. When part of a model involves a discrete choice or non-differentiable operation, we can't compute the gradient and standard gradient descent won't work. Networks with hard threshold activations or discrete sampling steps require special techniques to train.
Stochastic Gradient Descent (SGD)
Our OR example had only four input-output patterns, so computing the total error over all of them is trivial. But real neural networks are often trained on millions of examples—images, sentences, audio clips, or other data. In these cases, computing the gradient over all the training data at once can be very expensive. You'd need to process every single example just to take a single step. This is slow.
Stochastic Gradient Descent (SGD) addresses this by computing the gradient on a small random subset of the data called a mini-batch. Instead of using all training examples to estimate the gradient, you use, say, 32 or 64 examples at a time. The resulting gradient estimate is noisier (less precise), but it points in roughly the right direction—and you can compute it much faster.The word stochastic means "random" or "involving chance." In SGD, the randomness comes from randomly selecting which training examples to include in each mini-batch. Each time you compute a gradient, you're using a different random sample of the data, which is why the gradient estimates vary from step to step.
The update rule is the same as before:
The only difference is that the gradient
Interestingly, the noise in SGD can actually be helpful. The randomness can prevent the optimizer from getting stuck in shallow local minima—small dips in the loss landscape that aren't actually very good solutions. The noise gives the optimization process a kind of "jiggle" that helps it explore and find better solutions. There is also evidence that SGD tends to find solutions that generalize better to new data, partly because of this noise.
The demo below compares full-batch gradient descent (blue) with stochastic gradient descent (red). Both start from the same point and try to reach the minimum at the center. Notice how the full-batch path is smooth and direct, while SGD bounces around due to the noisy gradient estimates—yet both eventually reach the minimum.
Loss: 10.250
Loss: 10.250
Comparing full-batch gradient descent (blue) with stochastic gradient descent (red). Both start from the same point and descend toward the minimum. The full-batch path is smooth and direct, while SGD bounces around due to noisy gradient estimates—yet both reach the goal.
In modern deep learning, nearly all training uses some variant of SGD. More advanced optimizers like Adam or RMSProp build on the basic SGD idea by adapting the learning rate for each parameter, but the core principle is the same: estimate the gradient from a mini-batch, then take a step downhill.
The Limits of Perceptrons
Our perceptron successfully learned the OR function. It can also learn AND and other simple logical functions. But there's a fundamental limit to what a single perceptron can learn.
Consider the XOR (exclusive or) function. XOR outputs 1 when exactly one of the inputs is 1, and 0 otherwise:
| XOR output ( | ||
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
If you try to train a single perceptron on XOR, it will fail. No matter how long you train or what learning rate you use, the error never goes to zero—the perceptron gets stuck outputting about 0.5 for every input.
Why? The answer becomes clear when we visualize the problem geometrically. Think of each input pattern as a point in 2D space, with

For the OR function, the two classes (0s and 1s) can be separated by a single straight line. Problems like this are called linearly separable.
For OR (and AND), you can draw a single straight line that separates the 0s from the 1s. Problems like this are called linearly separable. A perceptron can solve any linearly separable problem because its decision boundary is a straight line (or hyperplane in higher dimensions).
But look at XOR:

For XOR, no single straight line can separate the 0s from the 1s—the classes are interleaved diagonally. XOR is not linearly separable.
There's no single straight line that can separate the 0s from the 1s! The two classes are interleaved diagonally. XOR is not linearly separable, and no single perceptron can learn it.
Adding Hidden Units
The solution is to add more neurons. Specifically, we add a hidden layer—neurons that sit between the inputs and the output:
A network with one hidden unit (
This network has:
- Two input units (
and ) - One hidden unit (
) that computes its own weighted sum and applies a sigmoid - One output unit (
) that receives input from both the original inputs AND the hidden unit
The hidden unit can learn to detect a useful intermediate feature. In the case of XOR, the hidden unit essentially learns to detect when both inputs are 1—which is exactly the case where XOR should output 0 instead of 1.
Here's what the network learns:
| Hidden unit ( | Output ( | Target ( | ||
|---|---|---|---|---|
| 0 | 0 | ~0 | ~0.05 | 0 |
| 0 | 1 | ~0 | ~0.94 | 1 |
| 1 | 0 | ~0 | ~0.94 | 1 |
| 1 | 1 | ~1 | ~0.05 | 0 |
The hidden unit outputs ~1 only when both inputs are 1. The output unit then uses this information to produce the correct XOR output.
The power of hidden layers
By adding hidden units, we can solve problems that are impossible for a single perceptron. Each hidden unit can learn to detect a different feature or pattern in the input. With enough hidden units, neural networks can approximate any function—this is called the universal approximation theorem. This is why modern deep learning uses networks with many layers and millions of hidden units.
Multi-layer Perceptron (MLP)
A network with one or more hidden layers is called a multi-layer perceptron (MLP). Let's write out exactly what happens when we add a hidden layer, because understanding this structure is key to understanding how we train these networks.
A multi-layer perceptron (MLP) with two inputs, one hidden unit, and one output. The computation flows left to right in two stages.
In this network, the computation happens in stages. Each stage is a function, and the output of one function becomes the input to the next. This is called function composition.
Stage 1: Compute the hidden unit
The hidden unit
Here
Stage 2: Compute the output
The output unit receives the original inputs AND the hidden unit's output, computes another weighted sum, and applies the sigmoid:
Here
Notice what's happening: the output
This looks complicated, but it's just functions nested inside functions—like
The Gradient Problem
For the output layer weights (
But what about the hidden layer weights (
How do we compute
The answer lies in the chain rule from calculus, which tells us how to compute derivatives of composed functions.
The Chain Rule
The chain rule tells us how to differentiate composed functions. If
In words: multiply the derivative of the outer function by the derivative of the inner function. We "chain" the derivatives together.
Worked example
Suppose
Using the chain rule:
- The derivative of
with respect to is - The derivative of
with respect to is - By the chain rule:
Let's check at
Applying the Chain Rule to Neural Networks
Now we can answer our question: how do we compute
We break it into steps, following the chain of dependencies:
- The error
depends on the output - The output
depends on the hidden unit - The hidden unit
depends on the weight
Using the chain rule:
Each piece is something we can compute:
is just like the single perceptron case—it involves the sigmoid derivative and the input requires another application of the chain rule, going through the output layer
This process of computing gradients by working backward from the error, layer by layer, is called backpropagation. Let's work through it in detail.
The Backpropagation Algorithm
Backpropagation is just the chain rule applied systematically through the network. The "back" in backpropagation refers to the fact that we start at the output (where we measure the error) and propagate the gradient information backward through each layer.Preliminaries
Before we derive the gradients, let's review some useful notation and facts we've already learned.
The error function is the squared difference between the prediction and target:
where
A useful property of the sigmoid function
This is convenient because if we've already computed
Gradient for Output Layer Weights
Let's first compute the gradient for a weight in the output layer, like
Working through each piece:
— how the error changes with the output — the sigmoid derivative — the input that multiplies
Putting it together:
This is the gradient for an output layer weight. It depends on the error
Gradient for Hidden Layer Weights
Computing the gradient for a hidden layer weight using backpropagation. The gradient flows backward from the error through the output layer to the hidden layer.
Now comes the key insight of backpropagation. For a hidden layer weight like
Step 1: How does the error change with the hidden unit's activation?
The hidden unit
The first two terms are the same as before. The third term is:
because
Putting it together:
Notice that
Step 2: How does the hidden unit's activation change with the weight?
This part is just like the single perceptron case:
Putting it all together:
The Pattern
Notice the pattern: to compute the gradient for a weight deep in the network, we multiply together derivatives along the path from the error back to that weight. Each layer contributes:
- A sigmoid derivative term
- The weight connecting to the next layer (for hidden layers)
- The input to that weight
This is why it's called "back"-propagation: we start at the output error and work backward, accumulating these terms as we go.
Once we have all the gradients, we update every weight using gradient descent:
where
Vectors, Matrices, and Matrix Multiplication
To understand the mechanics of neural networks, we need a bit of linear algebra—specifically, vectors, matrices, and matrix multiplication.
A vector is an ordered list of numbers. For example:
This is a vector with three elements. Geometrically, you can think of a vector as an arrow pointing to a location in space—the vector above points to the position
In a neural network, the inputs to a layer are typically represented as a vector. For example, if a layer receives three input values, those values form a 3-dimensional vector. An image with 784 pixels becomes a 784-dimensional vector.
A matrix is a rectangular grid of numbers. For example:
This is a
Matrix-vector multiplication is the key operation. When you multiply a matrix
Each element of
So
In Python with PyTorch, this is just a single operation:
import torch
# Define the weight matrix (2 rows, 3 columns)
W = torch.tensor([[1., 0., -2.],
[3., 1., 4.]])
# Define the input vector (3 elements)
x = torch.tensor([2., 5., -1.])
# Matrix-vector multiplication
y = W @ x # or torch.matmul(W, x)
print(y) # tensor([4., 7.])Why does this matter for neural networks?
A single layer of a neural network does exactly this operation! Given an input vector
where
A neural network layer reinterpreted as matrix multiplication. Each row of the weight matrix
This matrix formulation is not just a convenient notation—it's the reason neural networks can be trained efficiently on modern hardware. GPUs (graphics processing units) are specifically designed to perform many matrix multiplications in parallel at tremendous speed. By expressing neural network operations as matrix math, we can take full advantage of this hardware.
Putting It All Together
Let's tie everything together by sketching how a neural network uses all of these concepts.
A neural network is fundamentally a composition of nonlinear functions. Each layer takes its input, multiplies by a weight matrix, adds a bias, and applies a nonlinear activation function. The output of one layer becomes the input to the next:
Here
Why nonlinearity matters
A function is linear if it satisfies
Here's the key insight: if you stack linear functions on top of each other, you just get another linear function. No matter how many layers you add, the whole network collapses to a single linear transformation. For example, if layer 1 computes
Nonlinear activation functions like sigmoid break this collapse. Because
Training uses a loss function. We compare the network's output
Backpropagation computes gradients. Using the chain rule, we compute how the loss depends on every weight in the network. Starting from the output and working backward through each layer, we compute:
(and similarly for the biases). This is the "back-propagation" of gradients through the chain of composed functions.
SGD updates the weights. Using these gradients, we update every weight in the network:
This process repeats over many mini-batches. Gradually the network's predictions improve and the loss decreases.
That's the core loop of neural network training: forward pass (compute the output), compute the loss, backward pass (compute gradients via the chain rule), and update weights (via SGD). All of the math we covered in this chapter—derivatives, the chain rule, gradients, gradient descent, and matrix multiplication—comes together in this loop.
In PyTorch
In practice, we don't compute these gradients by hand. Modern deep learning frameworks like PyTorch do it automatically. Here's all the code you need to define and train our XOR network:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.i2h = nn.Linear(2, 1) # input to hidden (w3, w4)
self.all2o = nn.Linear(3, 1) # inputs + hidden to output (w0, w1, w2)
def forward(self, x):
# Stage 1: compute hidden unit
netinput_h = self.i2h(x)
h = torch.sigmoid(netinput_h)
# Stage 2: compute output
x2 = torch.cat((x, h)) # combine inputs with hidden
netinput_y = self.all2o(x2)
out = torch.sigmoid(netinput_y)
return out
# Training step
def update(pattern, target, net, optimizer):
optimizer.zero_grad() # reset gradients
output = net(pattern) # forward pass
loss = F.mse_loss(output, target) # compute error
loss.backward() # backward pass (computes all gradients!)
optimizer.step() # update weights with gradient descentThe key line is loss.backward()—this single call runs the entire backpropagation algorithm, computing
You now know what's happening under the hood when that line runs!
Learn More
If you'd like to go deeper into the math behind neural networks, here are some excellent resources:
- 3Blue1Brown: Essence of Calculus — A beautiful visual introduction to calculus, including derivatives and the chain rule.
- 3Blue1Brown: Essence of Linear Algebra — An equally excellent visual series on vectors, matrices, and linear transformations.
- 3Blue1Brown: Neural Networks — A series that ties together the math and neural networks, including a great explanation of backpropagation.
- Khan Academy: Multivariable Calculus — Comprehensive lessons on partial derivatives and gradients.