Math for Neural Networks DRAFT

Note

This chapter authored by Todd M. Gureckis. See the license information for this book. The draft chapter means it should not be considered complete or accurate!!

Introduction

Neural networks are one of the most exciting and impactful ideas in cognitive science and artificial intelligence. However, understanding how they work requires some mathematical background that might not be shared by all students.

The goal of these notes is to build up the key ideas you need with an emphasis on intuition rather than formal rigor. We want you to understand why these math concepts matter for neural networks.

The chapter follows a natural progression: we start with calculus fundamentals—derivatives—that let us measure how changing inputs affects outputs. We then introduce the perceptron (the simplest neural network) and show how to measure its errors with a loss function. With a concrete model in hand, we extend derivatives to gradients (for functions with many inputs) and introduce gradient descent, the core optimization strategy for training. Along the way, we'll see why single-layer networks have fundamental limits and how multi-layer networks overcome them. The chain rule and backpropagation show how gradients flow backward through layers of computation. Finally, linear algebra (vectors and matrices) reveals how all of this scales efficiently to large networks.

Let's get started.

Functions and Derivatives

**function**(?)

is simply a rule that takes an input and produces an output. For example, the function $f (x) = x^{2}$ takes a number $x$ and returns its square. If $x = 3$ , then $f (3) = 9$ .

f(x) = x²

f(x) = x³

f(x) = sin(x)

Explore different functions by selecting them from the dropdown. Notice how each function maps inputs to outputs in a different way.

In the context of neural networks, functions are everywhere. Each layer of a neural network applies a function to its inputs to produce outputs. The whole network is one big function that maps inputs (like pixel values of an image) to outputs (like the probability that the image contains a cat).

**derivative**(?)

tells you how much the output of a function changes when you make a tiny change to the input. More precisely, the derivative of $f (x)$ at a particular point $x$ is the slope of the function at that point—the rate at which $f$ is increasing or decreasing.

Derivative intuition

Imagine you are hiking on a hill and your altitude is given by the function $f (x)$ where $x$ is your horizontal position. The derivative $f^{'} (x)$ tells you how steep the hill is at your current location. If $f^{'} (x) > 0$ , you're going uphill. If $f^{'} (x) < 0$ , you're going downhill. If $f^{'} (x) = 0$ , you're on flat ground (perhaps at the top or bottom of the hill).

For the function $f (x) = x^{2}$ , the derivative is $f^{'} (x) = 2 x$ . This means:

At $x = 3$ , the slope is $f^{'} (3) = 6$ —the function is increasing steeply.
At $x = 0$ , the slope is $f^{'} (0) = 0$ —the function is flat at its minimum.
At $x = - 2$ , the slope is $f^{'} (- 2) = - 4$ —the function is decreasing.

You can explore this intuition below. Drag your mouse of the curve to see the slope at the point. The formula for the derivative is also displayed showing how it is computed.

Function:

Move your mouse along the curve to see the tangent line and slope at each point

Interactive derivative explorer. Drag your mouse over the curve to see the slope (derivative) at each point. The tangent line shows the direction the function is heading.

Why do derivatives matter for neural networks? Because training a neural network means finding the settings (called weights) that make the network perform well. To find good weights, we need to know how changing a weight affects the network's performance. That's exactly what a derivative tells us.

Let's make this concrete by introducing the simplest neural network.

The Perceptron

The perceptron is the simplest possible neural network. It has some inputs, a set of weights, a bias, and produces a single output.

Here's what it looks like with two inputs:

A perceptron with two inputs

A perceptron with two inputs ( $x_{0}$ and $x_{1}$ ), two weights ( $w_{0}$ and $w_{1}$ ), a bias ( $b$ ), and one output ( $\hat{y}$ ). The perceptron computes a weighted sum of its inputs, adds the bias, and passes the result through an activation function.

The perceptron computes its output in two steps:

Weighted sum: Multiply each input by its corresponding weight, add them up, and add the bias:
$net = x_{0} w_{0} + x_{1} w_{1} + b = \sum_{i} x_{i} w_{i} + b$
Activation function: Pass this sum through an activation function $g$ to produce the output: $$\hat{y} = g(\text{net})$$

The activation function we'll use is the sigmoid (also called the logistic function):

g (net) = \frac{1}{1 + e^{- net}}

Why use an activation function?

The sigmoid squashes any input into a value between 0 and 1. This is useful for several reasons:

It lets us interpret the output as a probability
It introduces nonlinearity, which is essential for learning complex patterns
It's differentiable everywhere, which we need for gradient descent

The sigmoid isn't the only activation function. The original perceptron used a simple threshold (or step) function: output 1 if the weighted sum exceeds a threshold, otherwise output 0. Other common choices include ReLU (rectified linear unit) and tanh:

Threshold

g(x) = 1 if x > 0

Sigmoid

g(x) = 1/(1+e⁻ˣ)

Tanh

g(x) = tanh(x)

ReLU

g(x) = max(0, x)

Common activation functions used in neural networks. Each has different properties: sigmoid outputs values between 0 and 1, tanh outputs between -1 and 1, and ReLU outputs zero for negative inputs and the input itself for positive inputs.

You can explore how a perceptron with a threshold activation function works in the interactive Neuron Sandbox.

Here's a summary of the notation:

Symbol	Meaning
$x_{0}, x_{1}$	Input values
$w_{0}, w_{1}$	Weights (what we learn)
$b$	Bias (also learned)
$g$	Activation function (sigmoid)
$\hat{y}$	Predicted output
$y$	Target output (what we want)

So the full computation is:

\hat{y} = g (\sum_{i} x_{i} w_{i} + b) = \frac{1}{1 + e^{- (\sum_{i} x_{i} w_{i} + b)}}

This single equation describes everything the perceptron does. Given inputs $x_{0}$ and $x_{1}$ , it computes a weighted sum, adds the bias, and passes the result through the sigmoid to produce $\hat{y}$ .

The goal of training is to find values for $w_{0}$ , $w_{1}$ , and $b$ that make $\hat{y}$ close to the target $y$ for all our training examples. To do that, we need to measure how wrong the network is—which brings us to the error function.

Computing the Error (or Loss) of a Perceptron Network

Now that we have a perceptron, we need a way to measure how wrong its predictions are. This measurement is called the error or loss.

Let's train our perceptron to compute the logical OR function—the same example from the lecture. OR should output 1 if either (or both) inputs are 1, and 0 only when both inputs are 0:

$x_{0}$	$x_{1}$	Target output ( $y$ )
0	0	0
0	1	1
1	0	1
1	1	1

The hidden unit can learn to detect a useful intermediate feature. In the case of XOR, the hidden unit essentially learns to detect when both inputs are 1—which is exactly the case where XOR should output 0 instead of 1.

Here's what the network learns:

$x_{0}$	$x_{1}$	Hidden unit ( $h$ )	Output ( $\hat{y}$ )	Target ( $y$ )
0	0	~0	~0.05	0
0	1	~0	~0.94	1
1	0	~0	~0.94	1
1	1	~1	~0.05	0

The hidden unit outputs ~1 only when both inputs are 1. The output unit then uses this information to produce the correct XOR output.

The power of hidden layers

By adding hidden units, we can solve problems that are impossible for a single perceptron. Each hidden unit can learn to detect a different feature or pattern in the input. With enough hidden units, neural networks can approximate any function—this is called the universal approximation theorem. This is why modern deep learning uses networks with many layers and millions of hidden units.

Multi-layer Perceptron (MLP)

A network with one or more hidden layers is called a multi-layer perceptron (MLP). Let's write out exactly what happens when we add a hidden layer, because understanding this structure is key to understanding how we train these networks.

A multi-layer perceptron with one hidden unit

A multi-layer perceptron (MLP) with two inputs, one hidden unit, and one output. The computation flows left to right in two stages.

In this network, the computation happens in stages. Each stage is a function, and the output of one function becomes the input to the next. This is called function composition.

Stage 1: Compute the hidden unit

The hidden unit $h$ receives the inputs $x_{0}$ and $x_{1}$ , computes a weighted sum, and applies the sigmoid:

h = g (x_{0} w_{3} + x_{1} w_{4} + b_{1})

Here $w_{3}$ and $w_{4}$ are the weights connecting the inputs to the hidden unit, and $b_{1}$ is the hidden unit's bias.

Stage 2: Compute the output

The output unit receives the original inputs AND the hidden unit's output, computes another weighted sum, and applies the sigmoid:

\hat{y} = g (x_{0} w_{0} + x_{1} w_{1} + h \cdot w_{2} + b_{0})

Here $w_{0}$ , $w_{1}$ , and $w_{2}$ are the weights connecting to the output, and $b_{0}$ is the output bias.

Notice what's happening: the output $\hat{y}$ depends on $h$ , and $h$ depends on $w_{3}$ and $w_{4}$ . So if we write out the full computation, we get a composition of functions:

\hat{y} = g (x_{0} w_{0} + x_{1} w_{1} + \underset{this is h}{\underset{⏟}{g (x_{0} w_{3} + x_{1} w_{4} + b_{1})}} \cdot w_{2} + b_{0})

This looks complicated, but it's just functions nested inside functions—like $g (f (x))$ but with more pieces.

The Gradient Problem

For the output layer weights ( $w_{0}$ , $w_{1}$ , $w_{2}$ ), computing the gradient is straightforward—we already know how to do this from our single perceptron. We just ask: "if I wiggle $w_{1}$ a little, how does the error change?"

But what about the hidden layer weights ( $w_{3}$ and $w_{4}$ )? These weights don't directly connect to the output. They affect the hidden unit $h$ , which then affects the output. The error is computed at the output, but $w_{3}$ is two steps removed from it.

How do we compute $\frac{\partial E}{\partial w_{3}}$ —the gradient of the error with respect to a weight deep inside the network?

The answer lies in the chain rule from calculus, which tells us how to compute derivatives of composed functions.

The Chain Rule

The chain rule tells us how to differentiate composed functions. If $y = g (f (x))$ —meaning we first apply $f$ , then apply $g$ —the derivative is:

\frac{d y}{d x} = \frac{d g}{d f} \cdot \frac{d f}{d x}

In words: multiply the derivative of the outer function by the derivative of the inner function. We "chain" the derivatives together.

Worked example

Suppose $f (x) = 3 x + 1$ and $g (u) = u^{2}$ . Then:

y = g (f (x)) = (3 x + 1)^{2}

Using the chain rule:

The derivative of $g (u) = u^{2}$ with respect to $u$ is $g^{'} (u) = 2 u$
The derivative of $f (x) = 3 x + 1$ with respect to $x$ is $f^{'} (x) = 3$
By the chain rule: $\frac{d y}{d x} = 2 (3 x + 1) \cdot 3 = 6 (3 x + 1)$

Let's check at $x = 1$ : $y = (3 \cdot 1 + 1)^{2} = 16$ , and the derivative is $6 (3 + 1) = 24$ . This means that at $x = 1$ , a tiny increase in $x$ would cause $y$ to increase about 24 times as fast.

Applying the Chain Rule to Neural Networks

Now we can answer our question: how do we compute $\frac{\partial E}{\partial w_{3}}$ ?

We break it into steps, following the chain of dependencies:

The error $E$ depends on the output $\hat{y}$
The output $\hat{y}$ depends on the hidden unit $h$
The hidden unit $h$ depends on the weight $w_{3}$

Using the chain rule:

\frac{\partial E}{\partial w_{3}} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial w_{3}}

Each piece is something we can compute:

$\frac{\partial h}{\partial w_{3}}$ is just like the single perceptron case—it involves the sigmoid derivative and the input $x_{0}$
$\frac{\partial E}{\partial h}$ requires another application of the chain rule, going through the output layer

This process of computing gradients by working backward from the error, layer by layer, is called backpropagation. Let's work through it in detail.

The Backpropagation Algorithm

Backpropagation(?) is just the chain rule applied systematically through the network. The "back" in backpropagation refers to the fact that we start at the output (where we measure the error) and propagate the gradient information backward through each layer.

Preliminaries

Before we derive the gradients, let's review some useful notation and facts we've already learned.

The error function is the squared difference between the prediction and target:

E = (\hat{y} - y)^{2} = (g ({net}_{y}) - y)^{2}

where ${net}_{y}$ is the weighted sum going into the output unit.

A useful property of the sigmoid function $g$ is that its derivative has a simple form:

\frac{\partial g (net)}{\partial net} = g (net) \cdot (1 - g (net))

This is convenient because if we've already computed $g (net)$ during the forward pass, we can compute its derivative with just a multiplication—no need to recompute the exponential.

Gradient for Output Layer Weights

Let's first compute the gradient for a weight in the output layer, like $w_{0}$ (which connects $x_{0}$ to the output). Using the chain rule:

\frac{\partial E}{\partial w_{0}} = \frac{\partial E}{\partial g ({net}_{y})} \cdot \frac{\partial g ({net}_{y})}{\partial {net}_{y}} \cdot \frac{\partial {net}_{y}}{\partial w_{0}}

Working through each piece:

$\frac{\partial E}{\partial g ({net}_{y})} = 2 (g ({net}_{y}) - y) = 2 (\hat{y} - y)$ — how the error changes with the output
$\frac{\partial g ({net}_{y})}{\partial {net}_{y}} = g ({net}_{y}) (1 - g ({net}_{y})) = \hat{y} (1 - \hat{y})$ — the sigmoid derivative
$\frac{\partial {net}_{y}}{\partial w_{0}} = x_{0}$ — the input that $w_{0}$ multiplies

Putting it together:

\frac{\partial E}{\partial w_{0}} = 2 (\hat{y} - y) \cdot \hat{y} (1 - \hat{y}) \cdot x_{0}

This is the gradient for an output layer weight. It depends on the error $(\hat{y} - y)$ , the sigmoid derivative, and the input $x_{0}$ .

Gradient for Hidden Layer Weights

Computing the gradient for a hidden layer weight

Computing the gradient for a hidden layer weight using backpropagation. The gradient flows backward from the error through the output layer to the hidden layer.

Now comes the key insight of backpropagation. For a hidden layer weight like $w_{3}$ , we use the two-step strategy:

\frac{\partial E}{\partial w_{3}} = \frac{\partial E}{\partial h} \cdot \frac{\partial h}{\partial w_{3}}

Step 1: How does the error change with the hidden unit's activation?

The hidden unit $h$ affects the error only through the output. So we apply the chain rule again:

\frac{\partial E}{\partial h} = \frac{\partial E}{\partial g ({net}_{y})} \cdot \frac{\partial g ({net}_{y})}{\partial {net}_{y}} \cdot \frac{\partial {net}_{y}}{\partial h}

The first two terms are the same as before. The third term is:

\frac{\partial {net}_{y}}{\partial h} = w_{2}

because ${net}_{y} = x_{0} w_{0} + x_{1} w_{1} + h \cdot w_{2} + b_{0}$ , so the derivative with respect to $h$ is just $w_{2}$ .

Putting it together:

\frac{\partial E}{\partial h} = 2 (\hat{y} - y) \cdot \hat{y} (1 - \hat{y}) \cdot w_{2}

Notice that $w_{2}$ appears here—the weight connecting the hidden unit to the output. This makes intuitive sense: if that connection is strong, changes in the hidden unit have a big effect on the error.

Step 2: How does the hidden unit's activation change with the weight?

This part is just like the single perceptron case:

\frac{\partial h}{\partial w_{3}} = \frac{\partial g ({net}_{h})}{\partial {net}_{h}} \cdot \frac{\partial {net}_{h}}{\partial w_{3}} = h (1 - h) \cdot x_{0}

Putting it all together:

\frac{\partial E}{\partial w_{3}} = \underset{Step 1: \partial E / \partial h}{\underset{⏟}{2 (\hat{y} - y) \cdot \hat{y} (1 - \hat{y}) \cdot w_{2}}} \cdot \underset{Step 2: \partial h / \partial w_{3}}{\underset{⏟}{h (1 - h) \cdot x_{0}}}

The Pattern

Notice the pattern: to compute the gradient for a weight deep in the network, we multiply together derivatives along the path from the error back to that weight. Each layer contributes:

A sigmoid derivative term
The weight connecting to the next layer (for hidden layers)
The input to that weight

This is why it's called "back"-propagation: we start at the output error and work backward, accumulating these terms as we go.

Once we have all the gradients, we update every weight using gradient descent:

w_{i} \leftarrow w_{i} - α \frac{\partial E}{\partial w_{i}}

where $α$ is the learning rate.

Vectors, Matrices, and Matrix Multiplication

To understand the mechanics of neural networks, we need a bit of linear algebra—specifically, vectors, matrices, and matrix multiplication.

A vector is an ordered list of numbers. For example:

x = (\begin{matrix} 2 \\ 5 \\ - 1 \end{matrix})

This is a vector with three elements. Geometrically, you can think of a vector as an arrow pointing to a location in space—the vector above points to the position $(2, 5, - 1)$ in 3D space. The length of the vector (how far that point is from the origin) and its direction are often what matter.

In a neural network, the inputs to a layer are typically represented as a vector. For example, if a layer receives three input values, those values form a 3-dimensional vector. An image with 784 pixels becomes a 784-dimensional vector.

A matrix is a rectangular grid of numbers. For example:

W = (\begin{matrix} 1 & 0 & - 2 \\ 3 & 1 & 4 \end{matrix})

This is a $2 \times 3$ matrix (2 rows, 3 columns). In a neural network, the weights connecting one layer to the next are stored in a matrix.

Matrix-vector multiplication is the key operation. When you multiply a matrix $W$ by a vector $x$ , you get a new vector $y$ :

y = W x

Each element of $y$ is computed by taking a dot product—multiplying corresponding elements and adding them up:

y_{1} = (1) (2) + (0) (5) + (- 2) (- 1) = 2 + 0 + 2 = 4

y_{2} = (3) (2) + (1) (5) + (4) (- 1) = 6 + 5 - 4 = 7

So $y = (\begin{matrix} 4 \\ 7 \end{matrix})$ .

In Python with PyTorch, this is just a single operation:

python

import torch

# Define the weight matrix (2 rows, 3 columns)
W = torch.tensor([[1., 0., -2.],
                  [3., 1.,  4.]])

# Define the input vector (3 elements)
x = torch.tensor([2., 5., -1.])

# Matrix-vector multiplication
y = W @ x  # or torch.matmul(W, x)

print(y)  # tensor([4., 7.])

Why does this matter for neural networks?

A single layer of a neural network does exactly this operation! Given an input vector $x$ , the layer computes:

y = σ (W x + b)

where $W$ is the weight matrix, $b$ is a bias vector (an offset added to each output), and $σ$ is a nonlinear activation function (like ReLU or sigmoid) applied to each element. The matrix multiplication $W x$ computes a weighted combination of all inputs for each output neuron. The figure below shows how we can reconceptualize the connections in a network as matrix multiplications:

A fully connected layer as matrix multiplication

A neural network layer reinterpreted as matrix multiplication. Each row of the weight matrix $W$ contains the weights for one output neuron.

This matrix formulation is not just a convenient notation—it's the reason neural networks can be trained efficiently on modern hardware. GPUs(?) (graphics processing units) are specifically designed to perform many matrix multiplications in parallel at tremendous speed. By expressing neural network operations as matrix math, we can take full advantage of this hardware.

Putting It All Together

Let's tie everything together by sketching how a neural network uses all of these concepts.

A neural network is fundamentally a composition of nonlinear functions. Each layer takes its input, multiplies by a weight matrix, adds a bias, and applies a nonlinear activation function. The output of one layer becomes the input to the next:

h_{1} = σ (W_{1} x + b_{1})

h_{2} = σ (W_{2} h_{1} + b_{2})

\hat{y} = W_{3} h_{2} + b_{3}

Here $x$ is the input, $h_{1}$ and $h_{2}$ are hidden layer activations, and $\hat{y}$ is the network's output (prediction).

Why nonlinearity matters

A function is linear if it satisfies $f (a x + b y) = a f (x) + b f (y)$ —scaling and adding inputs just scales and adds outputs. Linear functions can only represent straight lines (or flat planes in higher dimensions).

Here's the key insight: if you stack linear functions on top of each other, you just get another linear function. No matter how many layers you add, the whole network collapses to a single linear transformation. For example, if layer 1 computes $W_{1} x$ and layer 2 computes $W_{2} (W_{1} x)$ , this equals $(W_{2} W_{1}) x$ —just one matrix multiplication!

Nonlinear activation functions like sigmoid break this collapse. Because $σ (W_{2} \cdot σ (W_{1} x)) \neq σ ((W_{2} W_{1}) x)$ , each layer genuinely adds representational power. This is what allows deep networks to learn complex, curved decision boundaries and approximate virtually any function—not just straight lines.

Training uses a loss function. We compare the network's output $\hat{y}$ to the true answer $y$ using a loss function $L (\hat{y}, y)$ . The loss is a single number that measures how wrong the network is—lower is better.

Backpropagation computes gradients. Using the chain rule, we compute how the loss depends on every weight in the network. Starting from the output and working backward through each layer, we compute:

\frac{\partial L}{\partial W_{3}}, \frac{\partial L}{\partial W_{2}}, \frac{\partial L}{\partial W_{1}}

(and similarly for the biases). This is the "back-propagation" of gradients through the chain of composed functions.

SGD updates the weights. Using these gradients, we update every weight in the network:

W_{i} \leftarrow W_{i} - η \cdot \frac{\partial L}{\partial W_{i}}

This process repeats over many mini-batches. Gradually the network's predictions improve and the loss decreases.

That's the core loop of neural network training: forward pass (compute the output), compute the loss, backward pass (compute gradients via the chain rule), and update weights (via SGD). All of the math we covered in this chapter—derivatives, the chain rule, gradients, gradient descent, and matrix multiplication—comes together in this loop.

In PyTorch

In practice, we don't compute these gradients by hand. Modern deep learning frameworks like PyTorch do it automatically. Here's all the code you need to define and train our XOR network:

python

import torch
import torch.nn as nn
import torch.nn.functional as F

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.i2h = nn.Linear(2, 1)    # input to hidden (w3, w4)
        self.all2o = nn.Linear(3, 1)  # inputs + hidden to output (w0, w1, w2)

    def forward(self, x):
        # Stage 1: compute hidden unit
        netinput_h = self.i2h(x)
        h = torch.sigmoid(netinput_h)

        # Stage 2: compute output
        x2 = torch.cat((x, h))        # combine inputs with hidden
        netinput_y = self.all2o(x2)
        out = torch.sigmoid(netinput_y)
        return out

# Training step
def update(pattern, target, net, optimizer):
    optimizer.zero_grad()             # reset gradients
    output = net(pattern)             # forward pass
    loss = F.mse_loss(output, target) # compute error
    loss.backward()                   # backward pass (computes all gradients!)
    optimizer.step()                  # update weights with gradient descent

The key line is loss.backward()—this single call runs the entire backpropagation algorithm, computing $\frac{\partial E}{\partial w}$ for every weight in the network. PyTorch keeps track of all the operations during the forward pass and automatically applies the chain rule in reverse.

You now know what's happening under the hood when that line runs!

Learn More

If you'd like to go deeper into the math behind neural networks, here are some excellent resources:

3Blue1Brown: Essence of Calculus — A beautiful visual introduction to calculus, including derivatives and the chain rule.
3Blue1Brown: Essence of Linear Algebra — An equally excellent visual series on vectors, matrices, and linear transformations.
3Blue1Brown: Neural Networks — A series that ties together the math and neural networks, including a great explanation of backpropagation.
Khan Academy: Multivariable Calculus — Comprehensive lessons on partial derivatives and gradients.

Math for Neural Networks DRAFT ​

Introduction ​

Functions and Derivatives ​

The Perceptron ​

Computing the Error (or Loss) of a Perceptron Network ​

Multi-layer Perceptron (MLP) ​

The Gradient Problem ​

The Chain Rule ​

Applying the Chain Rule to Neural Networks ​

The Backpropagation Algorithm ​

Preliminaries ​

Gradient for Output Layer Weights ​

Gradient for Hidden Layer Weights ​

The Pattern ​

Vectors, Matrices, and Matrix Multiplication ​

Putting It All Together ​

In PyTorch ​

Learn More ​