Inputs & Target

x1 = 0.6
x2 = -0.4
y  = 1  (target label)

The network receives two input features x1 and x2. We also know the correct label y (0 or 1). Everything else is computed from these values and the current weights.

∂ Backprop

2→2→1 network · tanh · sigmoid · BCE

Inputs

x₁0.60

x₂-0.40

y (target)

Weights

Hidden layer 1

w11

w12

Hidden layer 2

w21

w22

Output layer

w31

w32

Live Values

ŷ (prediction)0.6653

L (loss)0.4076

∂L/∂z3-0.3347

About

Step through every computation in a 2→2→1 network — forward pass, then backward pass via the chain rule.

Use ← → keys to step. Space to auto-play.

What Is Backpropagation?

Backpropagation is the algorithm that makes neural network training possible. At its core, backprop answers a simple but critical question: how much did each weight in the network contribute to the overall error? Once you know that, you can adjust every weight in the right direction to reduce the error and make better predictions next time.

Despite its importance, backpropagation is not some mysterious black-box technique. It is simply the chain rule of calculus applied systematically through the layers of a neural network. If you can take a derivative and multiply numbers together, you can understand backprop. The algorithm was popularized in a landmark 1986 paper by Rumelhart, Hinton, and Williams, and it remains the backbone of how every modern neural network learns, from simple classifiers to state-of-the-art transformers.

The Forward Pass

Before a network can learn, it needs to make a prediction. During the forward pass, data flows from the input layer through the hidden layers to the output. At each neuron, the incoming values are multiplied by their corresponding weights, summed together with a bias term, and then passed through an activation function like tanh, ReLU, or sigmoid.

In the 2-2-1 network shown in the visualizer above, two inputs (x1 and x2) are each connected to two hidden neurons. Each hidden neuron computes a weighted sum plus bias, then applies the tanh activation function to introduce nonlinearity. The two hidden activations are then fed to a single output neuron, which applies the sigmoid function to produce a prediction between 0 and 1.

The final prediction is compared against the true target value using a loss function. This tool uses binary cross-entropy loss, the standard choice for binary classification. The loss is a single number that quantifies how wrong the prediction is: zero when perfect, and increasingly large as the prediction diverges from the target.

The Backward Pass

The backward pass is where the actual learning happens. Starting from the loss, gradients flow backward through the network, layer by layer, from output to input. At each node, the chain rule is applied to decompose the gradient into local derivatives. Each weight in the network receives a gradient that tells it exactly how much to change, and in which direction, to reduce the loss.

The backward pass starts by computing how the loss changes with respect to the network's output (the prediction). Then it asks: how does the output change with respect to the pre-activation value? How does the pre-activation change with respect to each weight and each hidden activation? This chain of questions continues backward through every layer until every weight has a gradient.

The Chain Rule in Action

The chain rule is the mathematical engine that powers backpropagation. It says that if a variable L depends on a variable y-hat, which depends on z3, which depends on w31, then the derivative of L with respect to w31 is the product of three local derivatives: (dL/dy-hat) times (dy-hat/dz3) times (dz3/dw31).

Each factor in this product is a simple, local computation. The derivative of the loss with respect to the prediction depends only on the loss function. The derivative of sigmoid depends only on its output value. The derivative of a weighted sum with respect to a weight is just the input value. Backpropagation's elegance is that it computes these local derivatives once and then reuses them as gradients propagate through the graph.

This reuse is what makes backprop efficient. Rather than computing a separate forward pass for each weight to estimate its gradient numerically, backprop computes all gradients in a single backward sweep. For a network with millions of parameters, this is the difference between practical and impossible.

Why Backpropagation Matters

Before backpropagation was widely adopted, training neural networks with more than one or two layers was impractical. Earlier methods either could not compute gradients for hidden layers at all, or required extremely expensive numerical approximations. Backprop changed everything by making gradient computation O(n) in the number of operations in the network, essentially the same cost as the forward pass itself.

This efficiency is what made deep learning possible. Modern networks have hundreds of layers and billions of parameters, but backpropagation can still compute every gradient in a single pass. Combined with GPU acceleration and large datasets, backprop-trained networks now achieve human-level performance on tasks in vision, language, speech, and beyond.

Understanding backpropagation deeply gives you intuition for why certain architectures work, why gradients can vanish or explode in deep networks, and why techniques like batch normalization, skip connections, and careful weight initialization exist. It is the single most important algorithm to understand if you want to build or debug neural networks.

Step through the computation above to see exactly how each gradient is computed. Change the inputs, adjust the weights, and watch how every value in the network responds. There is no better way to build intuition for how neural networks learn.