Neural Networks, Mathematically: From Neurons to Functions

Most people never start their journey into AI by opening a math textbook. They begin with libraries, tutorials, and impressive demos — and only later wonder: “But what is actually happening inside?”

That’s completely normal. And it’s exactly why this article exists.

A neural network is not magic, nor is it an incomprehensible black box. At its core, it is a remarkably simple idea built from just two things: linear algebra and simple nonlinear switches. These are stacked layer by layer, and the whole system learns by slowly adjusting millions of numbers.

In this essay, we will walk through every single step as if you have never seen a matrix or a derivative before. We’ll begin with one artificial neuron — literally how a bunch of input numbers get turned into an output number. We’ll go slowly and carefully: how weights are chosen using variance and standard deviation, why bias is usually initialized to zero or a small constant, what activation functions really do, and how the entire network learns through calculus and gradient descent.

By the end, you will understand not just how neural networks work, but why they work the way they do — with clear equations, hand-calculated examples, and patient explanations.

If you’re ready to move past “it just works” and truly see the machinery, let’s begin.

The Basic Building Block – A Single Neuron (Perceptron)

Let’s start with the smallest possible unit in any neural network: a single artificial neuron, also called a perceptron.

Imagine a neuron that receives several input values. Each input represents a piece of information—for example, pixel intensity in an image or a feature like “age” or “income” in a dataset.

Inputs, Weights, and Biases

The neuron does two simple things with these inputs:

  1. It multiplies each input by a weight.
  2. It adds a bias term and then applies an activation function.

What Exactly Are the Inputs? (They Are Always Numbers)

Every input that reaches a neuron is a number (or a vector of numbers). The neuron itself never sees raw files, images, audio clips, or text. It only works with numerical data.

Who converts the raw data into numbers? You (or your code) do this in a preprocessing step that happens before the data ever reaches the neural network. The neuron does not convert anything by itself. Libraries like NumPy, OpenCV, PyTorch, TensorFlow, or Hugging Face handle this conversion automatically.

Here’s how different types of real-world data are turned into numbers:

  • Images and Pixels An image is a grid of pixels. Each pixel has intensity values (0–255 for grayscale, or three values for RGB color). A 28×28 grayscale image becomes a vector of 784 numbers. Modern networks often keep the 2D/3D shape as a tensor (height × width × channels) instead of flattening everything. The preprocessing step usually normalizes values to the range 0–1 or -1 to 1 so the math works better.
  • Audio Files Audio is sampled into thousands of amplitude values per second (a waveform). It is often converted into a spectrogram (a 2D image-like representation of frequencies over time) or Mel-frequency cepstral coefficients (MFCCs). The result: a matrix or vector of numbers that the neuron can process.
  • Text Files Text is first split into tokens (words or sub-words). Each token is then mapped to a dense vector of numbers using word embeddings (e.g., Word2Vec, GloVe, or learned embeddings from transformers). Example: the word “cat” might become the vector [0.12, -0.45, 0.78, …] of length 300 or 768. This vector becomes the input to the neuron.
  • Videos A video is a sequence of image frames plus a time dimension. Each frame is processed exactly like an image (pixels → numbers), and the network treats the whole video as a 4D tensor (frames × height × width × channels).

In short: no matter the original format, everything is turned into numbers before the neuron layer. The neuron only ever “sees” floats.

Weights: How They Are Given and Why Randomly

Weights and biases are the learnable parameters of the neuron. At the very beginning of training, you don’t know the right values yet, so you initialize them randomly.

Why random? If you started with all weights as zero (or all the same number), every neuron in a layer would compute exactly the same output. The network could never learn different features.

The math behind common initialization methods:

  1. Xavier/Glorot initialization (good for sigmoid/tanh activations) Weights are drawn from a normal distribution with variance:
Var(w)=2nin+nout\text{Var}(w) = \frac{2}{n_{\text{in}} + n_{\text{out}}}
  • where ninn_{\text{in}} is the number of inputs to the neuron and noutn_{\text{out}}​ is the number of neurons in the next layer.
  1. He/Kaiming initialization (best for ReLU and modern networks)
Var(w)=2nin\text{Var}(w) = \frac{2}{n_{\text{in}}}

This keeps the variance of activations roughly constant across layers, preventing vanishing or exploding gradients.

How the Numbers Are Generated – The Exact Procedure

  1. Calculate the variance using the Xavier/Glorot formula:
Var(w)=2nin+nout\text{Var}(w) = \frac{2}{n_{\text{in}} + n_{\text{out}}}

Let’s take the same example:

nin=3n_{\text{in}} = 3 neurons

nout=2n_{\text{out}} = 2 neurons

Variance = 23+2=0.4\frac{2}{3+2} = 0.4

  1. Compute the standard deviation (the “width” of the bell curve):
σ=Var(w)\sigma = \sqrt{\text{Var}(w)}

Standard deviation ≈ 0.632

  1. Draw each individual weight from a normal distribution with
    • Mean = 0
    • Standard deviation = σ\sigma(i.e., variance = 0.4 in the example)

Why mean = 0? We want weights to be centered around zero — some positive, some negative. This prevents the network from starting with an overall bias in one direction (all positive or all negative).

Does “Normal(mean=0, std=0.632)” mean the range is 0 to 0.632? No. This is the most common confusion.

The normal distribution is symmetric around the mean (0).

  • About 68% of the values fall between −0.632 and +0.632
  • About 95% fall between −1.26 and +1.26 (roughly ±2 × std)
  • About 99.7% fall between −1.90 and +1.90 (roughly ±3 × std)

The framework (PyTorch, TensorFlow, or NumPy) generates independent random numbers from Normal(mean=0, std=0.632).

Here is one possible weight matrix produced this way (these are the exact numbers I showed earlier, generated with the real formula):

W=[0.3140.0870.4100.9630.1480.148]\mathbf{W} = \begin{bmatrix} 0.314 & -0.087 & 0.410 \\ 0.963 & -0.148 & -0.148 \end{bmatrix}

Important:

  • The variance formula controls the spread, not the range limits.
  • Values can (and sometimes do) go slightly beyond ±3σ, though they become rare.
  • This controlled spread prevents the activations from exploding or vanishing when they pass through many layers.

Biases are usually initialized to zero or a small constant (e.g., 0.01). They don’t need the same random treatment as weights.

During training, these random starting values are gradually adjusted using backpropagation and gradient descent until they become useful.

How Bias Works: Per Neuron, But Added Across the Whole Layer

Each individual neuron has its own bias (a single number). It is not shared across the entire layer.

However, when you implement a whole layer of neurons at once (which is how it’s done in practice), you use vectorized math. The bias becomes a bias vector — one bias value per neuron in that layer.

In equation form, for a layer with nn neurons receiving input vector xx:

z=Wx+b\mathbf{z} = \mathbf{W} \mathbf{x} + \mathbf{b}
  • W\mathbf{W} is the weight matrix (rows = neurons in this layer, columns = inputs),
  • b\mathbf{b} is the bias vector (one bias per neuron),
  • z\mathbf{z} is the pre-activation vector for the entire layer.

In code, the bias vector is broadcasted and added to every row automatically. So mathematically each neuron still gets its own personal bias added to its own weighted sum — it’s just written more efficiently for the whole layer at once.

The Full Computation Inside One Neuron

Weights act like volume knobs. A large positive weight amplifies that input’s importance. A negative weight reduces or reverses its effect. The bias shifts the entire result up or down, giving the neuron more flexibility.

Here is the math in its simplest scalar form (one input):

z=wx+bz = w \cdot x + b

where:

  • xx is the input value,
  • ww is the weight for that input,
  • bb is the bias,
  • zz is the weighted sum (often called the pre-activation or logit).

Next, you pass zz through an activation function ff, which produces the neuron’s final output:

a=f(z)=f(wx+b)a = f(z) = f(w \cdot x + b)

This output aa can then become an input to another neuron or serve as the final prediction.

Common Activation Functions

Without an activation function, the entire network would behave like one big linear equation—no matter how many layers you stack, it could only solve simple linear problems. Activation functions introduce non-linearity, which is what allows neural networks to model complex, curved patterns in data.

Here are the three most common ones you’ll meet early:

  1. Sigmoid (smooth S-shape, outputs between 0 and 1):
σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}
  1. ReLU (Rectified Linear Unit – very popular in modern networks because it’s fast and helps avoid vanishing gradients):
ReLU(z)=max(0,z)ReLU(z)=max⁡(0,z)
  1. Tanh (hyperbolic tangent, outputs between -1 and 1):
tanh(z)=ezezez+ez\tanh(z) = \frac{e^{z} - e^{-z}}{e^{z} + e^{-z}}

Each has strengths and weaknesses. For now, remember this: the activation function decides whether and how strongly the neuron “fires.”

A Small Numerical Example You Can Calculate by Hand

Let’s make this concrete with real numbers.

Suppose a single neuron receives two inputs:

  • x1=2x_1=2 (for example, a feature value)
  • x2=3x_2=3

The weights are w1=0.5w_1=0.5, w2=0.2w_2=−0.2, and the bias b=0.1b=0.1. We’ll use the sigmoid activation function.

First, compute the weighted sum:

z=(0.5×2)+(0.2×3)+0.1=1.00.6+0.1=0.5z=(0.5×2)+(−0.2×3)+0.1=1.0−0.6+0.1=0.5

Now apply sigmoid:

a=σ(0.5)=11+e0.50.622a = \sigma(0.5) = \frac{1}{1 + e^{-0.5}} \approx 0.622

The neuron’s output is approximately 0.622. If this were part of a classification task, you might interpret values above 0.5 as “class 1” and below as “class 0”.

Try changing the weights or inputs yourself — you’ll quickly see how sensitive the output is to small changes. This sensitivity is exactly what allows the network to learn during training.

From One Neuron to Layers – Feedforward Networks

Why Vector and Matrix Notation Makes Life Easier

Real neurons rarely have just one input. They usually receive many inputs at once.

Instead of writing separate equations for each input, we use compact vector notation. Let’s say the neuron receives three inputs x1,x2,x3x_1,x_2,x_3. We represent the inputs as a vector xx and the weights as a vector ww:

z=wx+bz = w^\top x + b

or, written fully:

z=w1x1+w2x2+w3x3+bz = w_1 x_1 + w_2 x_2 + w_3 x_3 + b

This scales beautifully. Even if there are 1,000 inputs, the notation stays clean and the computer performs the calculation with fast matrix operations.