Skip to content

Math for Neural Networks DRAFT

Note

This chapter authored by Todd M. Gureckis. See the license information for this book. The draft chapter means it should not be considered complete or accurate!!

Introduction

Neural networks are one of the most exciting and impactful ideas in cognitive science and artificial intelligence. However, understanding how they work requires some mathematical background that might not be shared by all students.

The goal of these notes is to build up the key ideas you need with an emphasis on intuition rather than formal rigor. We want you to understand why these math concepts matter for neural networks.

The chapter follows a natural progression: we start with calculus fundamentals—derivatives—that let us measure how changing inputs affects outputs. We then introduce the perceptron (the simplest neural network) and show how to measure its errors with a loss function. With a concrete model in hand, we extend derivatives to gradients (for functions with many inputs) and introduce gradient descent, the core optimization strategy for training. Along the way, we'll see why single-layer networks have fundamental limits and how multi-layer networks overcome them. The chain rule and backpropagation show how gradients flow backward through layers of computation. Finally, linear algebra (vectors and matrices) reveals how all of this scales efficiently to large networks.

Let's get started.

Functions and Derivatives

A function(?) is simply a rule that takes an input and produces an output. For example, the function f(x)=x2 takes a number x and returns its square. If x=3, then f(3)=9.

Explore different functions by selecting them from the dropdown. Notice how each function maps inputs to outputs in a different way.

In the context of neural networks, functions are everywhere. Each layer of a neural network applies a function to its inputs to produce outputs. The whole network is one big function that maps inputs (like pixel values of an image) to outputs (like the probability that the image contains a cat).

A derivative(?) tells you how much the output of a function changes when you make a tiny change to the input. More precisely, the derivative of f(x) at a particular point x is the slope of the function at that point—the rate at which f is increasing or decreasing.

Derivative intuition

Imagine you are hiking on a hill and your altitude is given by the function f(x) where x is your horizontal position. The derivative f(x) tells you how steep the hill is at your current location. If f(x)>0, you're going uphill. If f(x)<0, you're going downhill. If f(x)=0, you're on flat ground (perhaps at the top or bottom of the hill).

For the function f(x)=x2, the derivative is f(x)=2x. This means:

  • At x=3, the slope is f(3)=6—the function is increasing steeply.
  • At x=0, the slope is f(0)=0—the function is flat at its minimum.
  • At x=2, the slope is f(2)=4—the function is decreasing.

You can explore this intuition below. Drag your mouse of the curve to see the slope at the point. The formula for the derivative is also displayed showing how it is computed.

Function:
-3-2-10123-10123456789f(x) = x²
Move your mouse along the curve to see the tangent line and slope at each point

Interactive derivative explorer. Drag your mouse over the curve to see the slope (derivative) at each point. The tangent line shows the direction the function is heading.

Why do derivatives matter for neural networks? Because training a neural network means finding the settings (called weights) that make the network perform well. To find good weights, we need to know how changing a weight affects the network's performance. That's exactly what a derivative tells us.

Let's make this concrete by introducing the simplest neural network.

The Perceptron

The perceptron is the simplest possible neural network. It has some inputs, a set of weights, a bias, and produces a single output.

Here's what it looks like with two inputs:

A perceptron with two inputs

A perceptron with two inputs (x0 and x1), two weights (w0 and w1), a bias (b), and one output (y^). The perceptron computes a weighted sum of its inputs, adds the bias, and passes the result through an activation function.

The perceptron computes its output in two steps:

  1. Weighted sum: Multiply each input by its corresponding weight, add them up, and add the bias:

    net=x0w0+x1w1+b=ixiwi+b
  2. Activation function: Pass this sum through an activation function g to produce the output:

    y^=g(net)

The activation function we'll use is the sigmoid (also called the logistic function):

g(net)=11+enet

Why use an activation function?

The sigmoid squashes any input into a value between 0 and 1. This is useful for several reasons:

  • It lets us interpret the output as a probability
  • It introduces nonlinearity, which is essential for learning complex patterns
  • It's differentiable everywhere, which we need for gradient descent

The sigmoid isn't the only activation function. The original perceptron used a simple threshold (or step) function: output 1 if the weighted sum exceeds a threshold, otherwise output 0. Other common choices include ReLU (rectified linear unit) and tanh:

Common activation functions used in neural networks. Each has different properties: sigmoid outputs values between 0 and 1, tanh outputs between -1 and 1, and ReLU outputs zero for negative inputs and the input itself for positive inputs.

You can explore how a perceptron with a threshold activation function works in the interactive Neuron Sandbox.

Here's a summary of the notation:

SymbolMeaning
x0,x1Input values
w0,w1Weights (what we learn)
bBias (also learned)
gActivation function (sigmoid)
y^Predicted output
yTarget output (what we want)

So the full computation is:

y^=g(ixiwi+b)=11+e(ixiwi+b)

This single equation describes everything the perceptron does. Given inputs x0 and x1, it computes a weighted sum, adds the bias, and passes the result through the sigmoid to produce y^.

The goal of training is to find values for w0, w1, and b that make y^ close to the target y for all our training examples. To do that, we need to measure how wrong the network is—which brings us to the error function.

Computing the Error (or Loss) of a Perceptron Network

Now that we have a perceptron, we need a way to measure how wrong its predictions are. This measurement is called the error or loss.

Let's train our perceptron to compute the logical OR function—the same example from the lecture. OR should output 1 if either (or both) inputs are 1, and 0 only when both inputs are 0:

x0x1Target output (y)
000
011
101
111

When we feed each input pattern through the perceptron, it computes y^=g(x0w0+x1w1+b). Because the sigmoid function always outputs a value between 0 and 1, our predictions y^ are also always between 0 and 1. A prediction of 0.9 means "probably 1" while 0.1 means "probably 0."

Early in training, with random weights, the predictions will be wrong:

x0x1Target (y)Prediction (y^)Error (yy^)Squared error
0000.3-0.30.09
0110.40.60.36
1010.50.50.25
1110.60.40.16

Notice that we square the errors. This is common practice for two reasons: (1) it makes all errors positive, so they don't cancel out, and (2) it penalizes large errors more heavily than small ones.

The total error is the sum of the individual squared errors across all patterns:

E=i=1n(yiy^i)2

For our example: E=0.09+0.36+0.25+0.16=0.86

This single number summarizes how well (or poorly) the network is doing across all four training patterns. When we train the network, we're trying to adjust the weights to make this total error as small as possible.

Why sum all the errors?

You might wonder why we add up errors across all patterns rather than handling them one at a time. The sum gives us a single objective to optimize—one number that captures overall performance for all the patterns.

Partial Derivatives and Gradients

So far, we've talked about functions and derivatives of a single variable. But neural networks have many parameters. The perceptron example has 3 (w0, w1, and b), but useful neural networks often have millions of weights. We need to extend the idea of derivatives to functions of multiple variables.

Consider a function f(x,y)=x2+3xy. This function depends on two variables, x and y. A partial derivative(?) tells you how the function changes when you vary one of the inputs while holding the others fixed.

We write partial derivatives using the symbol (a curly "d"). The notation fx means "the partial derivative of f with respect to x"—that is, how f changes as x changes, treating all other variables as constants.

For our example:

  • fx=2x+3y — how f changes as x varies (holding y fixed)
  • fy=3x — how f changes as y varies (holding x fixed)

The gradient(?) is the vector(?) of all partial derivatives:

f=(fx,fy)

The symbol is called "nabla" or "del" and is used to denote the gradient. You can read f as "the gradient of f" or "grad f."

For our example, f=(2x+3y,3x).

Geometric intuition

Think of f(x,y) as describing the height of a landscape at each point (x,y). The gradient f at any point is an arrow that points in the direction of steepest ascent—the direction you'd walk if you wanted to go uphill as quickly as possible. The length of the arrow tells you how steep the slope is.

Try exploring the gradient demo below. Drag your mouse over the contour map to see the gradient vector (red arrow) pointing uphill. The panel on the left shows the function in 3D. The little arrows at the bottom allow you to rotate the 3d view. The panel on the left you can move you mouse to see the gradient at each (x,y) point. Remember the gradient is a vector(?) (see below), with one component for each input variable. The length of the gradient vector tells you how steep the slope is and also the direction to go to get to the highest point.

Try dragging around the circular contours in the bowl shape—notice how the gradient always points outward, up the sides of the bowl. Then try the other shapes and check your intuition about which direction the gradient should point.

Function:
3D Surface
xyff(x,y) = x² + y²
Contour Map (top view)
xy
Move your mouse over the contour map to see the gradient vector (red arrow) pointing uphill
Contour lines (equal height) Gradient ∇f (steepest uphill)

Interactive gradient explorer. The left panel shows a 3D surface; the right panel shows contour lines (like a topographic map). Drag your mouse to see the gradient vector (red arrow) at each point—it always points uphill.

In a neural network, the "landscape" is the loss function—a measure of how badly the network is performing (like the squared error we just defined). The inputs to this landscape are all the weights of the network. The gradient tells us, for each weight, which direction to adjust it to reduce the loss.

Gradient Descent

Now we can put the pieces together. Gradient descent(?) is an optimization algorithm used to train many kinds of models, including neural networks. The idea is simple:

  1. Compute the gradient of the loss function with respect to all the weights (parameters) of the model.
  2. Update each weight by taking a small step in the direction that reduces the loss.

Since the gradient points in the direction of steepest ascent, we move in the opposite direction—steepest descent—to reduce the loss. The update rule is:

wiwiαEwi

Here:

  • wi represents one of the weights of the network
  • E is the error (or loss) function
  • α (the Greek letter "alpha") is the learning rate—a small positive number that controls the step size
  • Ewi is the partial derivative of the error with respect to weight wi

So this is the "learning rule" for updating the weights. It says that we update each weight by taking a small step in the direction that reduces the error most steeply. Conceptually, this is not so different from a naive approach where you "wiggle" each weight a little bit and see if the error goes up or down. The gradient just tells us the answer mathematically—which direction to wiggle and by how much—without having to actually try both directions.

Since we update every weight in the network using this same rule, we can write all the updates at once using vector notation. If w is the vector of all weights, then:

wwαE(w)

Here E(w) is the gradient of the error—the vector of all partial derivatives (Ew1,Ew2,,Ewn). This compact notation says the same thing: move each weight in the direction that reduces the error most steeply.

The learning rate

The learning rate α is typically a small number like 0.01 or 0.001. If it's too large, you might overshoot the minimum and the training process could become unstable. If it's too small, training will be very slow. Finding a good learning rate is one of the practical challenges of training neural networks.

You can think of gradient descent as placing a ball on a hilly surface and letting it roll downhill. At each step, the ball moves in the steepest downhill direction. Eventually, it settles in a valley—a point where the loss is low.

Gradient descent is iterative: you repeat the process many times, each time nudging the weights a little bit to reduce the loss. Over many iterations, the network's performance improves and you reduce the loss (usually).

Try the interactive demo below to see gradient descent in action. Click Step to advance through each phase: computing the gradient, computing the update direction, applying the step, and checking the new loss.

Step: 0
Ready to start gradient descent
w₁w₂minimumstart Loss: E(w₁, w₂) = w₁² + w₂²
Current position:
w = (2.50, 2.00)
Loss:
E(w) = 10.250
Gradient:
∇E = (5.00, 4.00)
Update:
-α∇E = (-1.00, -0.80)
Gradient (uphill direction)
Update step (downhill)
Descent path
At each step: 1) Compute the gradient ∇E pointing uphill. 2) Take a step in the opposite direction: w ← w - α∇E. 3) Recompute the loss at the new position. The loss decreases as we descend toward the minimum.

Interactive gradient descent explorer. Click Step to advance through each phase: computing the gradient (red arrow), computing the update direction (green arrow), applying the step, and checking the new loss. The learning rate controls how far each step moves.

Gradient Descent in Action: Learning OR

Let's see gradient descent applied to our perceptron learning the OR function. Try the interactive demo below to watch a perceptron learn the OR function. Press Play to run gradient descent automatically, or use Step to advance one epoch(?) at a time. Notice how the predictions start far from the targets, but after many epochs the predictions get closer and closer to the correct values as the weights adjust and the loss decreases.

As training progresses, the weights w0, w1, and b are adjusted so that the predictions get closer to the targets. The squared error for each pattern shrinks, and the total error E decreases. After enough training, the perceptron outputs values close to 0 for input (0,0) and close to 1 for the other three patterns—it has learned OR.

Epoch: 0
Loss over time
EpochLoss
Total Error: 1.9206
Network
w₁=-0.78w₂=-0.27b=-0.52x₁x₂1ŷ
Predictions (OR)
x₁x₂TargetPredictionError²
0000.370.139
0110.310.475
1010.210.619
1110.170.688

A perceptron learning the OR function. Press Play to run gradient descent automatically, or Step to advance one epoch at a time. Watch how the predictions converge toward the target values as the weights adjust.

Neural networks as "differentiable computing"

Gradient descent only works for differentiable models—models where we can compute the gradient of the loss with respect to every parameter. This constraint is so fundamental that neural networks are sometimes described as differentiable computing: the art of building complex computations entirely from operations that have well-defined derivatives, so that gradients can flow through the entire system.

This is why neural networks use smooth activation functions (sigmoid, tanh, ReLU) rather than hard thresholds, and why operations like matrix multiplication are central—they're all differentiable. When part of a model involves a discrete choice or non-differentiable operation, we can't compute the gradient and standard gradient descent won't work. Networks with hard threshold activations or discrete sampling steps require special techniques to train.

Stochastic Gradient Descent (SGD)

Our OR example had only four input-output patterns, so computing the total error over all of them is trivial. But real neural networks are often trained on millions of examples—images, sentences, audio clips, or other data. In these cases, computing the gradient over all the training data at once can be very expensive. You'd need to process every single example just to take a single step. This is slow.

Stochastic Gradient Descent (SGD)(?) addresses this by computing the gradient on a small random subset of the data called a mini-batch. Instead of using all training examples to estimate the gradient, you use, say, 32 or 64 examples at a time. The resulting gradient estimate is noisier (less precise), but it points in roughly the right direction—and you can compute it much faster.

The word stochastic(?) means "random" or "involving chance." In SGD, the randomness comes from randomly selecting which training examples to include in each mini-batch. Each time you compute a gradient, you're using a different random sample of the data, which is why the gradient estimates vary from step to step.

The update rule is the same as before:

wiwiαEmini-batchwi

The only difference is that the gradient Emini-batchwi is computed from a small random sample rather than the full dataset.

Interestingly, the noise in SGD can actually be helpful. The randomness can prevent the optimizer from getting stuck in shallow local minima—small dips in the loss landscape that aren't actually very good solutions. The noise gives the optimization process a kind of "jiggle" that helps it explore and find better solutions. There is also evidence that SGD tends to find solutions that generalize better to new data, partly because of this noise.

The demo below compares full-batch gradient descent (blue) with stochastic gradient descent (red). Both start from the same point and try to reach the minimum at the center. Notice how the full-batch path is smooth and direct, while SGD bounces around due to the noisy gradient estimates—yet both eventually reach the minimum.

minimumstart Loss landscape: E(w₁, w₂) = w₁² + w₂²
Step: 0 / 50
Full Batch GD (smooth path)
Loss: 10.250
Stochastic GD (noisy path)
Loss: 10.250
Full batch computes the exact gradient using all data — the path is smooth and direct. SGD estimates the gradient from a random mini-batch — each estimate is slightly different, making the path noisy. Both reach the minimum, but SGD's path bounces around.

Comparing full-batch gradient descent (blue) with stochastic gradient descent (red). Both start from the same point and descend toward the minimum. The full-batch path is smooth and direct, while SGD bounces around due to noisy gradient estimates—yet both reach the goal.

In modern deep learning, nearly all training uses some variant of SGD. More advanced optimizers like Adam or RMSProp build on the basic SGD idea by adapting the learning rate for each parameter, but the core principle is the same: estimate the gradient from a mini-batch, then take a step downhill.

The Limits of Perceptrons

Our perceptron successfully learned the OR function. It can also learn AND and other simple logical functions. But there's a fundamental limit to what a single perceptron can learn.

Consider the XOR (exclusive or) function. XOR outputs 1 when exactly one of the inputs is 1, and 0 otherwise:

x0x1XOR output (y)
000
011
101
110

If you try to train a single perceptron on XOR, it will fail. No matter how long you train or what learning rate you use, the error never goes to zero—the perceptron gets stuck outputting about 0.5 for every input.

Why? The answer becomes clear when we visualize the problem geometrically. Think of each input pattern as a point in 2D space, with x0 on one axis and x1 on the other. Color each point by its target output:

Linearly separable problem (OR)

For the OR function, the two classes (0s and 1s) can be separated by a single straight line. Problems like this are called linearly separable.

For OR (and AND), you can draw a single straight line that separates the 0s from the 1s. Problems like this are called linearly separable. A perceptron can solve any linearly separable problem because its decision boundary is a straight line (or hyperplane in higher dimensions).

But look at XOR:

Non-linearly separable problem (XOR)

For XOR, no single straight line can separate the 0s from the 1s—the classes are interleaved diagonally. XOR is not linearly separable.

There's no single straight line that can separate the 0s from the 1s! The two classes are interleaved diagonally. XOR is not linearly separable, and no single perceptron can learn it.

Adding Hidden Units

The solution is to add more neurons. Specifically, we add a hidden layer—neurons that sit between the inputs and the output:

A multi-layer network that can learn XOR

A network with one hidden unit (h) between the inputs and output. This hidden layer allows the network to learn XOR—a problem impossible for a single perceptron.

This network has:

  • Two input units (x0 and x1)
  • One hidden unit (h) that computes its own weighted sum and applies a sigmoid
  • One output unit (y^) that receives input from both the original inputs AND the hidden unit

The hidden unit can learn to detect a useful intermediate feature. In the case of XOR, the hidden unit essentially learns to detect when both inputs are 1—which is exactly the case where XOR should output 0 instead of 1.

Here's what the network learns:

x0x1Hidden unit (h)Output (y^)Target (y)
00~0~0.050
01~0~0.941
10~0~0.941
11~1~0.050

The hidden unit outputs ~1 only when both inputs are 1. The output unit then uses this information to produce the correct XOR output.

The power of hidden layers

By adding hidden units, we can solve problems that are impossible for a single perceptron. Each hidden unit can learn to detect a different feature or pattern in the input. With enough hidden units, neural networks can approximate any function—this is called the universal approximation theorem. This is why modern deep learning uses networks with many layers and millions of hidden units.

Multi-layer Perceptron (MLP)

A network with one or more hidden layers is called a multi-layer perceptron (MLP). Let's write out exactly what happens when we add a hidden layer, because understanding this structure is key to understanding how we train these networks.

A multi-layer perceptron with one hidden unit

A multi-layer perceptron (MLP) with two inputs, one hidden unit, and one output. The computation flows left to right in two stages.

In this network, the computation happens in stages. Each stage is a function, and the output of one function becomes the input to the next. This is called function composition.

Stage 1: Compute the hidden unit

The hidden unit h receives the inputs x0 and x1, computes a weighted sum, and applies the sigmoid:

h=g(x0w3+x1w4+b1)

Here w3 and w4 are the weights connecting the inputs to the hidden unit, and b1 is the hidden unit's bias.

Stage 2: Compute the output

The output unit receives the original inputs AND the hidden unit's output, computes another weighted sum, and applies the sigmoid:

y^=g(x0w0+x1w1+hw2+b0)

Here w0, w1, and w2 are the weights connecting to the output, and b0 is the output bias.

Notice what's happening: the output y^ depends on h, and h depends on w3 and w4. So if we write out the full computation, we get a composition of functions:

y^=g(x0w0+x1w1+g(x0w3+x1w4+b1)this is hw2+b0)

This looks complicated, but it's just functions nested inside functions—like g(f(x)) but with more pieces.

The Gradient Problem

For the output layer weights (w0, w1, w2), computing the gradient is straightforward—we already know how to do this from our single perceptron. We just ask: "if I wiggle w1 a little, how does the error change?"

But what about the hidden layer weights (w3 and w4)? These weights don't directly connect to the output. They affect the hidden unit h, which then affects the output. The error is computed at the output, but w3 is two steps removed from it.

How do we compute Ew3—the gradient of the error with respect to a weight deep inside the network?

The answer lies in the chain rule from calculus, which tells us how to compute derivatives of composed functions.

The Chain Rule

The chain rule tells us how to differentiate composed functions. If y=g(f(x))—meaning we first apply f, then apply g—the derivative is:

dydx=dgdfdfdx

In words: multiply the derivative of the outer function by the derivative of the inner function. We "chain" the derivatives together.

Worked example

Suppose f(x)=3x+1 and g(u)=u2. Then:

y=g(f(x))=(3x+1)2

Using the chain rule:

  • The derivative of g(u)=u2 with respect to u is g(u)=2u
  • The derivative of f(x)=3x+1 with respect to x is f(x)=3
  • By the chain rule: dydx=2(3x+1)3=6(3x+1)

Let's check at x=1: y=(31+1)2=16, and the derivative is 6(3+1)=24. This means that at x=1, a tiny increase in x would cause y to increase about 24 times as fast.

Applying the Chain Rule to Neural Networks

Now we can answer our question: how do we compute Ew3?

We break it into steps, following the chain of dependencies:

  1. The error E depends on the output y^
  2. The output y^ depends on the hidden unit h
  3. The hidden unit h depends on the weight w3

Using the chain rule:

Ew3=Ehhw3

Each piece is something we can compute:

  • hw3 is just like the single perceptron case—it involves the sigmoid derivative and the input x0
  • Eh requires another application of the chain rule, going through the output layer

This process of computing gradients by working backward from the error, layer by layer, is called backpropagation. Let's work through it in detail.

The Backpropagation Algorithm

Backpropagation(?) is just the chain rule applied systematically through the network. The "back" in backpropagation refers to the fact that we start at the output (where we measure the error) and propagate the gradient information backward through each layer.

Preliminaries

Before we derive the gradients, let's review some useful notation and facts we've already learned.

The error function is the squared difference between the prediction and target:

E=(y^y)2=(g(nety)y)2

where nety is the weighted sum going into the output unit.

A useful property of the sigmoid function g is that its derivative has a simple form:

g(net)net=g(net)(1g(net))

This is convenient because if we've already computed g(net) during the forward pass, we can compute its derivative with just a multiplication—no need to recompute the exponential.

Gradient for Output Layer Weights

Let's first compute the gradient for a weight in the output layer, like w0 (which connects x0 to the output). Using the chain rule:

Ew0=Eg(nety)g(nety)netynetyw0

Working through each piece:

  1. Eg(nety)=2(g(nety)y)=2(y^y) — how the error changes with the output
  2. g(nety)nety=g(nety)(1g(nety))=y^(1y^) — the sigmoid derivative
  3. netyw0=x0 — the input that w0 multiplies

Putting it together:

Ew0=2(y^y)y^(1y^)x0

This is the gradient for an output layer weight. It depends on the error (y^y), the sigmoid derivative, and the input x0.

Gradient for Hidden Layer Weights

Computing the gradient for a hidden layer weight

Computing the gradient for a hidden layer weight using backpropagation. The gradient flows backward from the error through the output layer to the hidden layer.

Now comes the key insight of backpropagation. For a hidden layer weight like w3, we use the two-step strategy:

Ew3=Ehhw3

Step 1: How does the error change with the hidden unit's activation?

The hidden unit h affects the error only through the output. So we apply the chain rule again:

Eh=Eg(nety)g(nety)netynetyh

The first two terms are the same as before. The third term is:

netyh=w2

because nety=x0w0+x1w1+hw2+b0, so the derivative with respect to h is just w2.

Putting it together:

Eh=2(y^y)y^(1y^)w2

Notice that w2 appears here—the weight connecting the hidden unit to the output. This makes intuitive sense: if that connection is strong, changes in the hidden unit have a big effect on the error.

Step 2: How does the hidden unit's activation change with the weight?

This part is just like the single perceptron case:

hw3=g(neth)nethnethw3=h(1h)x0

Putting it all together:

Ew3=2(y^y)y^(1y^)w2Step 1: E/hh(1h)x0Step 2: h/w3

The Pattern

Notice the pattern: to compute the gradient for a weight deep in the network, we multiply together derivatives along the path from the error back to that weight. Each layer contributes:

  • A sigmoid derivative term
  • The weight connecting to the next layer (for hidden layers)
  • The input to that weight

This is why it's called "back"-propagation: we start at the output error and work backward, accumulating these terms as we go.

Once we have all the gradients, we update every weight using gradient descent:

wiwiαEwi

where α is the learning rate.

Vectors, Matrices, and Matrix Multiplication

To understand the mechanics of neural networks, we need a bit of linear algebra—specifically, vectors, matrices, and matrix multiplication.

A vector is an ordered list of numbers. For example:

x=(251)

This is a vector with three elements. Geometrically, you can think of a vector as an arrow pointing to a location in space—the vector above points to the position (2,5,1) in 3D space. The length of the vector (how far that point is from the origin) and its direction are often what matter.

In a neural network, the inputs to a layer are typically represented as a vector. For example, if a layer receives three input values, those values form a 3-dimensional vector. An image with 784 pixels becomes a 784-dimensional vector.

A matrix is a rectangular grid of numbers. For example:

W=(102314)

This is a 2×3 matrix (2 rows, 3 columns). In a neural network, the weights connecting one layer to the next are stored in a matrix.

Matrix-vector multiplication is the key operation. When you multiply a matrix W by a vector x, you get a new vector y:

y=Wx

Each element of y is computed by taking a dot product—multiplying corresponding elements and adding them up:

y1=(1)(2)+(0)(5)+(2)(1)=2+0+2=4y2=(3)(2)+(1)(5)+(4)(1)=6+54=7

So y=(47).

In Python with PyTorch, this is just a single operation:

python
import torch

# Define the weight matrix (2 rows, 3 columns)
W = torch.tensor([[1., 0., -2.],
                  [3., 1.,  4.]])

# Define the input vector (3 elements)
x = torch.tensor([2., 5., -1.])

# Matrix-vector multiplication
y = W @ x  # or torch.matmul(W, x)

print(y)  # tensor([4., 7.])

Why does this matter for neural networks?

A single layer of a neural network does exactly this operation! Given an input vector x, the layer computes:

y=σ(Wx+b)

where W is the weight matrix, b is a bias vector (an offset added to each output), and σ is a nonlinear activation function (like ReLU or sigmoid) applied to each element. The matrix multiplication Wx computes a weighted combination of all inputs for each output neuron. The figure below shows how we can reconceptualize the connections in a network as matrix multiplications:

A fully connected layer as matrix multiplication

A neural network layer reinterpreted as matrix multiplication. Each row of the weight matrix W contains the weights for one output neuron.

This matrix formulation is not just a convenient notation—it's the reason neural networks can be trained efficiently on modern hardware. GPUs(?) (graphics processing units) are specifically designed to perform many matrix multiplications in parallel at tremendous speed. By expressing neural network operations as matrix math, we can take full advantage of this hardware.

Putting It All Together

Let's tie everything together by sketching how a neural network uses all of these concepts.

A neural network is fundamentally a composition of nonlinear functions. Each layer takes its input, multiplies by a weight matrix, adds a bias, and applies a nonlinear activation function. The output of one layer becomes the input to the next:

h1=σ(W1x+b1)h2=σ(W2h1+b2)y^=W3h2+b3

Here x is the input, h1 and h2 are hidden layer activations, and y^ is the network's output (prediction).

Why nonlinearity matters

A function is linear if it satisfies f(ax+by)=af(x)+bf(y)—scaling and adding inputs just scales and adds outputs. Linear functions can only represent straight lines (or flat planes in higher dimensions).

Here's the key insight: if you stack linear functions on top of each other, you just get another linear function. No matter how many layers you add, the whole network collapses to a single linear transformation. For example, if layer 1 computes W1x and layer 2 computes W2(W1x), this equals (W2W1)x—just one matrix multiplication!

Nonlinear activation functions like sigmoid break this collapse. Because σ(W2σ(W1x))σ((W2W1)x), each layer genuinely adds representational power. This is what allows deep networks to learn complex, curved decision boundaries and approximate virtually any function—not just straight lines.

Training uses a loss function. We compare the network's output y^ to the true answer y using a loss function L(y^,y). The loss is a single number that measures how wrong the network is—lower is better.

Backpropagation computes gradients. Using the chain rule, we compute how the loss depends on every weight in the network. Starting from the output and working backward through each layer, we compute:

LW3,LW2,LW1

(and similarly for the biases). This is the "back-propagation" of gradients through the chain of composed functions.

SGD updates the weights. Using these gradients, we update every weight in the network:

WiWiηLWi

This process repeats over many mini-batches. Gradually the network's predictions improve and the loss decreases.

That's the core loop of neural network training: forward pass (compute the output), compute the loss, backward pass (compute gradients via the chain rule), and update weights (via SGD). All of the math we covered in this chapter—derivatives, the chain rule, gradients, gradient descent, and matrix multiplication—comes together in this loop.

In PyTorch

In practice, we don't compute these gradients by hand. Modern deep learning frameworks like PyTorch do it automatically. Here's all the code you need to define and train our XOR network:

python
import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.i2h = nn.Linear(2, 1)    # input to hidden (w3, w4)
        self.all2o = nn.Linear(3, 1)  # inputs + hidden to output (w0, w1, w2)

    def forward(self, x):
        # Stage 1: compute hidden unit
        netinput_h = self.i2h(x)
        h = torch.sigmoid(netinput_h)

        # Stage 2: compute output
        x2 = torch.cat((x, h))        # combine inputs with hidden
        netinput_y = self.all2o(x2)
        out = torch.sigmoid(netinput_y)
        return out

# Training step
def update(pattern, target, net, optimizer):
    optimizer.zero_grad()             # reset gradients
    output = net(pattern)             # forward pass
    loss = F.mse_loss(output, target) # compute error
    loss.backward()                   # backward pass (computes all gradients!)
    optimizer.step()                  # update weights with gradient descent

The key line is loss.backward()—this single call runs the entire backpropagation algorithm, computing Ew for every weight in the network. PyTorch keeps track of all the operations during the forward pass and automatically applies the chain rule in reverse.

You now know what's happening under the hood when that line runs!

Learn More

If you'd like to go deeper into the math behind neural networks, here are some excellent resources: