A neural network is a computational model loosely inspired by the human brain, built from layers of simple units called neurons. Each neuron receives one or more inputs, multiplies each input by a learned weight, sums the results, adds a bias term, and then passes the total through a nonlinear activation function. This simple operation, repeated across dozens or even millions of neurons arranged in successive layers, creates a remarkably powerful function approximator capable of learning complex patterns from data.
At its core, a neural network is just a mathematical function that maps inputs to outputs. The "deep" in deep learning refers to networks with multiple hidden layers between the input and output. Stacking layers allows the network to learn hierarchical representations: early layers detect simple features, while deeper layers combine those features into increasingly abstract concepts. This is why deep neural networks excel at tasks like image recognition, natural language processing, and game playing.
For classification tasks, a neural network learns to carve up the input space into distinct regions, each corresponding to a different class. The borders between these regions are called decision boundaries. A single neuron with a linear activation can only produce a straight line (or hyperplane) as a boundary. But once you introduce nonlinear activations and hidden layers, the network can bend and curve these boundaries into complex shapes.
A network with a single hidden layer and enough neurons can theoretically approximate any continuous decision boundary, a result known as the universal approximation theorem. In practice, adding more layers often works better than making a single layer very wide: deeper architectures can represent the same functions with fewer total parameters. The decision boundary visualization in the playground above shows this happening in real time as the network trains, letting you see exactly how the model separates the two classes.
Training a neural network means systematically adjusting its weights and biases to minimize a loss function that measures prediction error. The process follows a clear cycle: the network sees a batch of training examples, computes predictions through a forward pass, measures how wrong those predictions are using the loss function (such as binary cross-entropy for classification), and then updates its parameters via gradient descent.
The gradients, computed efficiently through backpropagation, tell each weight how much it contributed to the error and in which direction it should change. When you see the loss curve decreasing in the chart, it means the network is learning: each round of updates brings the predictions closer to the true labels. The speed and smoothness of this descent depend heavily on the learning rate, the optimizer, and the architecture.
The number of hidden layers and neurons per layer determines the network's capacity, or how complex a function it can represent. More neurons give the model more flexibility to fit intricate patterns, but they also increase the risk of overfitting, where the model memorizes training noise rather than learning generalizable features. A good rule of thumb is to start with a simple architecture and only add complexity if the model clearly underfits the training data.
The activation function determines the type of nonlinearity each neuron introduces. ReLU (Rectified Linear Unit) is the most popular default: it is computationally fast, avoids the vanishing gradient problem for positive inputs, and works well in most situations. Sigmoid and tanh produce bounded outputs and are useful for specific cases, but they can suffer from saturating gradients in deep networks. The choice of activation affects both what the network can learn and how efficiently gradients flow during training.
The optimizer controls how weight updates are computed from gradients. Plain SGD (stochastic gradient descent) is straightforward: it multiplies the gradient by the learning rate and subtracts. SGD with momentum accumulates a velocity term that helps the optimizer push through flat regions and noisy gradients. Adam goes further by maintaining adaptive per-parameter learning rates based on first and second moment estimates, which typically leads to faster and more stable convergence. The playground lets you switch between optimizers and see the difference in training dynamics immediately.
One of the most important concepts in machine learning is the bias-variance tradeoff. Underfitting occurs when the model is too simple to capture the underlying pattern: both training and validation loss remain high. Overfitting occurs when the model is too complex relative to the amount of training data: training loss drops, but validation loss starts to rise because the network has memorized noise in the training set rather than learning the true signal.
The playground displays both training loss and validation loss curves side by side, making it easy to diagnose these problems. When you see validation loss diverging upward while training loss continues to fall, that is the classic signature of overfitting. Solutions include using a simpler architecture, adding regularization, gathering more data, or stopping training early. Conversely, if both curves plateau at a high value, the model needs more capacity.
Try different architectures in the playground above to see how they affect the decision boundary. Experiment with the number of layers, neurons per layer, activation functions, and optimizers to build intuition for how these choices shape what a neural network can learn.