Math for Neural Networks DRAFT
Note
This chapter authored by Todd M. Gureckis. See the license information for this book. The draft chapter means it should not be considered complete or accurate!!
Introduction
Neural networks are one of the most exciting and impactful ideas in cognitive science and artificial intelligence. However, understanding how they work requires some mathematical background that might not be shared by all students.
The goal of these notes is to build up the key ideas you need with an emphasis on intuition rather than formal rigor. We want you to understand why these math concepts matter for neural networks.
The chapter follows a natural progression: we start with calculus fundamentals—derivatives—that let us measure how changing inputs affects outputs. We then introduce the perceptron (the simplest neural network) and show how to measure its errors with a loss function. With a concrete model in hand, we extend derivatives to gradients (for functions with many inputs) and introduce gradient descent, the core optimization strategy for training. Along the way, we'll see why single-layer networks have fundamental limits and how multi-layer networks overcome them. The chain rule and backpropagation show how gradients flow backward through layers of computation. Finally, linear algebra (vectors and matrices) reveals how all of this scales efficiently to large networks.
Let's get started.
Functions and Derivatives
A
**function**is simply a rule that takes an input and produces an output. For example, the function
Explore different functions by selecting them from the dropdown. Notice how each function maps inputs to outputs in a different way.
In the context of neural networks, functions are everywhere. Each layer of a neural network applies a function to its inputs to produce outputs. The whole network is one big function that maps inputs (like pixel values of an image) to outputs (like the probability that the image contains a cat).
A
**derivative**tells you how much the output of a function changes when you make a tiny change to the input. More precisely, the derivative of
Derivative intuition
Imagine you are hiking on a hill and your altitude is given by the function
For the function
- At
, the slope is —the function is increasing steeply. - At
, the slope is —the function is flat at its minimum. - At
, the slope is —the function is decreasing.
You can explore this intuition below. Drag your mouse of the curve to see the slope at the point. The formula for the derivative is also displayed showing how it is computed.
Interactive derivative explorer. Drag your mouse over the curve to see the slope (derivative) at each point. The tangent line shows the direction the function is heading.
Why do derivatives matter for neural networks? Because training a neural network means finding the settings (called weights) that make the network perform well. To find good weights, we need to know how changing a weight affects the network's performance. That's exactly what a derivative tells us.
Let's make this concrete by introducing the simplest neural network.
The Perceptron
The perceptron is the simplest possible neural network. It has some inputs, a set of weights, a bias, and produces a single output.
Here's what it looks like with two inputs:
A perceptron with two inputs (
The perceptron computes its output in two steps:
Weighted sum: Multiply each input by its corresponding weight, add them up, and add the bias:
Activation function: Pass this sum through an activation function
to produce the output: $$\hat{y} = g(\text{net})$$
The activation function we'll use is the sigmoid (also called the logistic function):
Why use an activation function?
The sigmoid squashes any input into a value between 0 and 1. This is useful for several reasons:
- It lets us interpret the output as a probability
- It introduces nonlinearity, which is essential for learning complex patterns
- It's differentiable everywhere, which we need for gradient descent
The sigmoid isn't the only activation function. The original perceptron used a simple threshold (or step) function: output 1 if the weighted sum exceeds a threshold, otherwise output 0. Other common choices include ReLU (rectified linear unit) and tanh:
Common activation functions used in neural networks. Each has different properties: sigmoid outputs values between 0 and 1, tanh outputs between -1 and 1, and ReLU outputs zero for negative inputs and the input itself for positive inputs.
You can explore how a perceptron with a threshold activation function works in the interactive Neuron Sandbox.
Here's a summary of the notation:
| Symbol | Meaning |
|---|---|
| Input values | |
| Weights (what we learn) | |
| Bias (also learned) | |
| Activation function (sigmoid) | |
| Predicted output | |
| Target output (what we want) |
So the full computation is:
This single equation describes everything the perceptron does. Given inputs
The goal of training is to find values for
Computing the Error (or Loss) of a Perceptron Network
Now that we have a perceptron, we need a way to measure how wrong its predictions are. This measurement is called the error or loss.
Let's train our perceptron to compute the logical OR function—the same example from the lecture. OR should output 1 if either (or both) inputs are 1, and 0 only when both inputs are 0:
| Target output ( | ||
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 1 |
The hidden unit can learn to detect a useful intermediate feature. In the case of XOR, the hidden unit essentially learns to detect when both inputs are 1—which is exactly the case where XOR should output 0 instead of 1.
Here's what the network learns:
| Hidden unit ( | Output ( | Target ( | ||
|---|---|---|---|---|
| 0 | 0 | ~0 | ~0.05 | 0 |
| 0 | 1 | ~0 | ~0.94 | 1 |
| 1 | 0 | ~0 | ~0.94 | 1 |
| 1 | 1 | ~1 | ~0.05 | 0 |
The hidden unit outputs ~1 only when both inputs are 1. The output unit then uses this information to produce the correct XOR output.
The power of hidden layers
By adding hidden units, we can solve problems that are impossible for a single perceptron. Each hidden unit can learn to detect a different feature or pattern in the input. With enough hidden units, neural networks can approximate any function—this is called the universal approximation theorem. This is why modern deep learning uses networks with many layers and millions of hidden units.
Multi-layer Perceptron (MLP)
A network with one or more hidden layers is called a multi-layer perceptron (MLP). Let's write out exactly what happens when we add a hidden layer, because understanding this structure is key to understanding how we train these networks.
A multi-layer perceptron (MLP) with two inputs, one hidden unit, and one output. The computation flows left to right in two stages.
In this network, the computation happens in stages. Each stage is a function, and the output of one function becomes the input to the next. This is called function composition.
Stage 1: Compute the hidden unit
The hidden unit
Here
Stage 2: Compute the output
The output unit receives the original inputs AND the hidden unit's output, computes another weighted sum, and applies the sigmoid:
Here
Notice what's happening: the output
This looks complicated, but it's just functions nested inside functions—like
The Gradient Problem
For the output layer weights (
But what about the hidden layer weights (
How do we compute
The answer lies in the chain rule from calculus, which tells us how to compute derivatives of composed functions.
The Chain Rule
The chain rule tells us how to differentiate composed functions. If
In words: multiply the derivative of the outer function by the derivative of the inner function. We "chain" the derivatives together.
Worked example
Suppose
Using the chain rule:
- The derivative of
with respect to is - The derivative of
with respect to is - By the chain rule:
Let's check at
Applying the Chain Rule to Neural Networks
Now we can answer our question: how do we compute
We break it into steps, following the chain of dependencies:
- The error
depends on the output - The output
depends on the hidden unit - The hidden unit
depends on the weight
Using the chain rule:
Each piece is something we can compute:
is just like the single perceptron case—it involves the sigmoid derivative and the input requires another application of the chain rule, going through the output layer
This process of computing gradients by working backward from the error, layer by layer, is called backpropagation. Let's work through it in detail.
The Backpropagation Algorithm
Backpropagation is just the chain rule applied systematically through the network. The "back" in backpropagation refers to the fact that we start at the output (where we measure the error) and propagate the gradient information backward through each layer.Preliminaries
Before we derive the gradients, let's review some useful notation and facts we've already learned.
The error function is the squared difference between the prediction and target:
where
A useful property of the sigmoid function
This is convenient because if we've already computed
Gradient for Output Layer Weights
Let's first compute the gradient for a weight in the output layer, like
Working through each piece:
— how the error changes with the output — the sigmoid derivative — the input that multiplies
Putting it together:
This is the gradient for an output layer weight. It depends on the error
Gradient for Hidden Layer Weights
Computing the gradient for a hidden layer weight using backpropagation. The gradient flows backward from the error through the output layer to the hidden layer.
Now comes the key insight of backpropagation. For a hidden layer weight like
Step 1: How does the error change with the hidden unit's activation?
The hidden unit
The first two terms are the same as before. The third term is:
because
Putting it together:
Notice that
Step 2: How does the hidden unit's activation change with the weight?
This part is just like the single perceptron case:
Putting it all together:
The Pattern
Notice the pattern: to compute the gradient for a weight deep in the network, we multiply together derivatives along the path from the error back to that weight. Each layer contributes:
- A sigmoid derivative term
- The weight connecting to the next layer (for hidden layers)
- The input to that weight
This is why it's called "back"-propagation: we start at the output error and work backward, accumulating these terms as we go.
Once we have all the gradients, we update every weight using gradient descent:
where
Vectors, Matrices, and Matrix Multiplication
To understand the mechanics of neural networks, we need a bit of linear algebra—specifically, vectors, matrices, and matrix multiplication.
A vector is an ordered list of numbers. For example:
This is a vector with three elements. Geometrically, you can think of a vector as an arrow pointing to a location in space—the vector above points to the position
In a neural network, the inputs to a layer are typically represented as a vector. For example, if a layer receives three input values, those values form a 3-dimensional vector. An image with 784 pixels becomes a 784-dimensional vector.
A matrix is a rectangular grid of numbers. For example:
This is a
Matrix-vector multiplication is the key operation. When you multiply a matrix
Each element of
So
In Python with PyTorch, this is just a single operation:
import torch
# Define the weight matrix (2 rows, 3 columns)
W = torch.tensor([[1., 0., -2.],
[3., 1., 4.]])
# Define the input vector (3 elements)
x = torch.tensor([2., 5., -1.])
# Matrix-vector multiplication
y = W @ x # or torch.matmul(W, x)
print(y) # tensor([4., 7.])Why does this matter for neural networks?
A single layer of a neural network does exactly this operation! Given an input vector
where
A neural network layer reinterpreted as matrix multiplication. Each row of the weight matrix
This matrix formulation is not just a convenient notation—it's the reason neural networks can be trained efficiently on modern hardware. GPUs (graphics processing units) are specifically designed to perform many matrix multiplications in parallel at tremendous speed. By expressing neural network operations as matrix math, we can take full advantage of this hardware.
Putting It All Together
Let's tie everything together by sketching how a neural network uses all of these concepts.
A neural network is fundamentally a composition of nonlinear functions. Each layer takes its input, multiplies by a weight matrix, adds a bias, and applies a nonlinear activation function. The output of one layer becomes the input to the next:
Here
Why nonlinearity matters
A function is linear if it satisfies
Here's the key insight: if you stack linear functions on top of each other, you just get another linear function. No matter how many layers you add, the whole network collapses to a single linear transformation. For example, if layer 1 computes
Nonlinear activation functions like sigmoid break this collapse. Because
Training uses a loss function. We compare the network's output
Backpropagation computes gradients. Using the chain rule, we compute how the loss depends on every weight in the network. Starting from the output and working backward through each layer, we compute:
(and similarly for the biases). This is the "back-propagation" of gradients through the chain of composed functions.
SGD updates the weights. Using these gradients, we update every weight in the network:
This process repeats over many mini-batches. Gradually the network's predictions improve and the loss decreases.
That's the core loop of neural network training: forward pass (compute the output), compute the loss, backward pass (compute gradients via the chain rule), and update weights (via SGD). All of the math we covered in this chapter—derivatives, the chain rule, gradients, gradient descent, and matrix multiplication—comes together in this loop.
In PyTorch
In practice, we don't compute these gradients by hand. Modern deep learning frameworks like PyTorch do it automatically. Here's all the code you need to define and train our XOR network:
import torch
import torch.nn as nn
import torch.nn.functional as F
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.i2h = nn.Linear(2, 1) # input to hidden (w3, w4)
self.all2o = nn.Linear(3, 1) # inputs + hidden to output (w0, w1, w2)
def forward(self, x):
# Stage 1: compute hidden unit
netinput_h = self.i2h(x)
h = torch.sigmoid(netinput_h)
# Stage 2: compute output
x2 = torch.cat((x, h)) # combine inputs with hidden
netinput_y = self.all2o(x2)
out = torch.sigmoid(netinput_y)
return out
# Training step
def update(pattern, target, net, optimizer):
optimizer.zero_grad() # reset gradients
output = net(pattern) # forward pass
loss = F.mse_loss(output, target) # compute error
loss.backward() # backward pass (computes all gradients!)
optimizer.step() # update weights with gradient descentThe key line is loss.backward()—this single call runs the entire backpropagation algorithm, computing
You now know what's happening under the hood when that line runs!
Learn More
If you'd like to go deeper into the math behind neural networks, here are some excellent resources:
- 3Blue1Brown: Essence of Calculus — A beautiful visual introduction to calculus, including derivatives and the chain rule.
- 3Blue1Brown: Essence of Linear Algebra — An equally excellent visual series on vectors, matrices, and linear transformations.
- 3Blue1Brown: Neural Networks — A series that ties together the math and neural networks, including a great explanation of backpropagation.
- Khan Academy: Multivariable Calculus — Comprehensive lessons on partial derivatives and gradients.