Why LLM mathematics comes after probability and tensors

In the previous sections, we studied:

$$ \text{vectors},\quad \text{matrices},\quad \text{tensors}, $$

then probability distributions, Gaussian noise, Markov chains, stochastic processes, and diffusion models.

Large language models use all of these ideas.

An LLM is not just a “text generator.” Mathematically, it is a large parameterized probability model over sequences of tokens.

A sentence becomes a sequence:

$$ x_1,x_2,\dots,x_T. $$

Each token becomes a vector:

$$ e_t \in \mathbb{R}^{d_{\text{model}}}. $$

A full sequence becomes a matrix:

$$ X \in \mathbb{R}^{T \times d_{\text{model}}}. $$

A batch becomes a tensor:

$$ X \in \mathbb{R}^{B \times T \times d_{\text{model}}}. $$

A transformer maps this tensor into another tensor, and the final layer produces a probability distribution over the next token.

The key mathematical goal is:

$$ p_\theta(x_{t+1}\mid x_1,\dots,x_t). $$

That means:

given previous tokens, predict the probability distribution of the next token.

Modern LLMs are built on the Transformer architecture, introduced in Attention Is All You Need, which replaced recurrence and convolution with attention mechanisms and enabled much more parallel training. (arXiv)

Part I — Text as mathematics

From words to numbers

A computer cannot directly process the sentence:

Mathematics is the language of AI.

It must first convert text into numbers.

A simple but bad idea is to assign one number to each word:

WordNumber
Mathematics1
is2
the3
language4
of5
AI6

Then the sentence becomes:

$$ [1,2,3,4,5,6]. $$

But this has a problem.

The number 6 is not “larger” than the number 1 in any meaningful semantic way. These numbers are just identifiers.

So we need a better representation.

The usual pipeline is:

$$ \text{text} \longrightarrow \text{tokens} \longrightarrow \text{token IDs} \longrightarrow \text{embedding vectors} \longrightarrow \text{transformer computation} \longrightarrow \text{next-token probabilities}. $$

Tokens

A token is a unit of text used by the model.

A token may be:

  • a word,
  • part of a word,
  • punctuation,
  • a space,
  • a byte sequence,
  • or another subword unit.

For example, the word:

mathematics

may be one token in one tokenizer, but split into subwords in another tokenizer.

Subword tokenization became important because natural language has rare words, new words, names, code symbols, numbers, and multilingual forms. BPE-based subword modeling was introduced for neural machine translation to handle rare and unknown words by encoding them as subword units rather than relying on a fixed word vocabulary. (arXiv)

SentencePiece later provided a language-independent tokenizer/detokenizer that can train directly from raw sentences, without assuming pre-tokenized word sequences. (arXiv)

Vocabulary

Let the vocabulary be:

$$ \mathcal{V} {}={} {1,2,\dots,V}. $$

Here:

  • \(V\) is the vocabulary size,
  • each token is represented by an integer ID,
  • \(x_t \in \mathcal{V}\) is the token at position \(t\).

For example:

$$ x = [x_1,x_2,x_3,x_4] {}={} [1542,89,301,7721]. $$

This is a sequence of token IDs.

The model does not yet understand meaning. These are still just integers.

One-hot representation

A token ID can be represented as a one-hot vector.

If:

$$ V=5 $$

and the token ID is:

$$ x_t=3, $$

then the one-hot vector is:

$$ o_t = \begin{bmatrix} 0\ 0\ 1\ 0\ 0 \end{bmatrix}. $$

So:

$$ o_t \in \mathbb{R}^{V}. $$

But one-hot vectors are huge and sparse.

If:

$$ V=100{,}000, $$

then every token would be a 100,000-dimensional vector with only one nonzero entry.

So LLMs use embeddings.

Embedding matrix

An embedding matrix is:

$$ E \in \mathbb{R}^{V \times d}. $$

Here:

  • \(V\) is vocabulary size,
  • \(d=d_{\text{model}}\) is embedding dimension.

Each row of \(E\) is a learned vector for one token.

If token \(x_t\) has ID \(i\), then its embedding is:

$$ e_t = E_i. $$

Equivalently, using a one-hot vector:

$$ e_t = o_t^\top E. $$

Shape-wise:

$$ (1\times V)(V\times d)=1\times d. $$

So a token ID becomes a dense vector:

$$ e_t \in \mathbb{R}^{d}. $$

This is the first major mathematical transformation:

$$ \text{discrete token} \longrightarrow \text{continuous vector}. $$

Sequence embedding matrix

For a sequence of \(T\) tokens:

$$ x_1,x_2,\dots,x_T, $$

we get embeddings:

$$ e_1,e_2,\dots,e_T. $$

Stack them row by row:

$$ X = \begin{bmatrix} e_1^\top\ e_2^\top\ \vdots\ e_T^\top \end{bmatrix} \in \mathbb{R}^{T\times d}. $$

For a batch of \(B\) sequences:

$$ X \in \mathbb{R}^{B\times T\times d}. $$

This is the tensor that enters the transformer.

Part II — Language modeling as probability

The language-modeling problem

A language model assigns probabilities to token sequences.

Given:

$$ x_1,x_2,\dots,x_T, $$

the model wants:

$$ p_\theta(x_1,x_2,\dots,x_T). $$

Using the chain rule of probability:

$$ p_\theta(x_1,\dots,x_T) {}={} \prod_{t=1}^{T} p_\theta(x_t\mid x_1,\dots,x_{t-1}). $$

This is the central factorization of autoregressive language modeling.

The model predicts each token using the previous tokens.

For example:

$$ p_\theta(\text{`AI''}\mid \text{`Mathematics for''}). $$

Autoregressive modeling

An autoregressive model generates tokens one at a time.

At step \(t\), it computes:

$$ p_\theta(x_{t+1}\mid x_1,\dots,x_t). $$

Then it chooses or samples the next token.

After generating \(x_{t+1}\), it repeats:

$$ p_\theta(x_{t+2}\mid x_1,\dots,x_t,x_{t+1}). $$

So generation is sequential:

$$ x_1 \rightarrow x_2 \rightarrow x_3 \rightarrow \cdots. $$

This sequential nature is one reason inference is expensive: generating \(K\) new tokens usually requires \(K\) decoding steps. Speculative decoding papers explicitly identify autoregressive token-by-token decoding as a major latency bottleneck. (arXiv)

Log-likelihood

Suppose the training dataset contains sequences:

$$ \mathcal{D}={x^{(1)},x^{(2)},\dots,x^{(N)}}. $$

The likelihood is:

$$ L(\theta) {}={} \prod_{n=1}^{N} p_\theta(x^{(n)}). $$

Because products of many probabilities become extremely small, we use log-likelihood:

$$ \log L(\theta) {}={} \sum_{n=1}^{N} \log p_\theta(x^{(n)}). $$

For autoregressive models:

$$ \log p_\theta(x^{(n)}) {}={} \sum_{t=1}^{T_n} \log p_\theta(x_t^{(n)}\mid x_{Training tries to maximize this quantity.

Equivalently, we minimize negative log-likelihood:

$$ \mathcal{L}(\theta) {}={} * \sum_{n=1}^{N} \sum_{t=1}^{T_n} \log p_\theta(x_t^{(n)}\mid x_{This is the basic pretraining objective.

The Deep Learning book connects maximum likelihood learning with cross-entropy loss for neural networks. (Deep Learning Book)

Softmax distribution over vocabulary

At each position \(t\), the model outputs logits:

$$ z_t \in \mathbb{R}^{V}. $$

Each component \(z_{t,i}\) is a score for token \(i\).

To convert logits into probabilities, use softmax:

$$ p_{t,i} {}={} \frac{\exp(z_{t,i})} {\sum_{j=1}^{V}\exp(z_{t,j})}. $$

The output vector is:

$$ p_t = \operatorname{softmax}(z_t) \in \mathbb{R}^{V}. $$

It satisfies:

$$ p_{t,i}\geq 0 $$

and:

$$ \sum_{i=1}^{V}p_{t,i}=1. $$

So softmax turns scores into a probability distribution.

Cross-entropy loss

If the true next token is \(y_t\), then the target distribution is one-hot:

$$ q_{t,i} {}={} \begin{cases} 1, & i=y_t,\ 0, & i\neq y_t. \end{cases} $$

The cross-entropy is:

$$ H(q_t,p_t) {}={} -\sum_{i=1}^{V}q_{t,i}\log p_{t,i}. $$

Because \(q_t\) is one-hot, this becomes:

$$ H(q_t,p_t) {}={} -\log p_{t,y_t}. $$

So for one token, the loss is:

$$ \ell_t {}={} -\log p_\theta(y_t\mid x_{For a full sequence:

$$ \mathcal{L} {}={} -\sum_{t=1}^{T} \log p_\theta(x_t\mid x_{Stanford CS224N lecture material presents neural-network training with cross-entropy as minimizing negative log probability of the correct class. (Stanford University)

Perplexity

Perplexity is a common language-model evaluation metric.

If the average negative log-likelihood per token is:

$$ \bar{\mathcal{L}} {}={} -\frac{1}{T} \sum_{t=1}^{T} \log p_\theta(x_t\mid x_{then perplexity is:

$$ \operatorname{PPL} {}={} \exp(\bar{\mathcal{L}}). $$

Lower perplexity means the model assigns higher probability to the correct tokens.

But perplexity is not the same as human usefulness, reasoning quality, safety, or instruction-following ability.

Part III — The Transformer block

Why transformers changed language modeling

Before transformers, sequence models often used recurrence.

A recurrent model processes tokens one by one:

$$ h_t=f(h_{t-1},x_t). $$

This makes training harder to parallelize across sequence positions.

The transformer instead uses attention, allowing all positions in a sequence to interact through matrix operations. The original Transformer paper proposed an architecture based solely on attention mechanisms, dispensing with recurrence and convolution. (arXiv)

Input tensor

For one sequence:

$$ X\in\mathbb{R}^{T\times d}. $$

For a batch:

$$ X\in\mathbb{R}^{B\times T\times d}. $$

A transformer block maps:

$$ X \mapsto X' $$

where:

$$ X'\in\mathbb{R}^{B\times T\times d}. $$

So the shape is preserved across blocks.

A stack of \(L\) transformer blocks gives:

$$ X^{(0)} \to X^{(1)} \to \cdots \to X^{(L)}. $$

Linear projections: queries, keys, and values

Inside attention, each token vector is projected into three vectors:

$$ q_t = x_t W_Q, $$$$ k_t = x_t W_K, $$$$ v_t = x_t W_V. $$

For the whole sequence:

$$ Q=XW_Q, $$$$ K=XW_K, $$$$ V=XW_V. $$

If:

$$ X\in\mathbb{R}^{T\times d}, $$

and:

$$ W_Q,W_K,W_V\in\mathbb{R}^{d\times d_k}, $$

then:

$$ Q,K,V\in\mathbb{R}^{T\times d_k}. $$

The names are intuitive:

  • query: what this token is looking for,
  • key: what this token offers for matching,
  • value: information to pass forward.

Dot-product attention

The similarity between token \(i\) and token \(j\) is:

$$ s_{ij}=q_i^\top k_j. $$

Stacking all similarities gives:

$$ S=QK^\top. $$

Shape-wise:

$$ (T\times d_k)(d_k\times T)=T\times T. $$

So:

$$ S\in\mathbb{R}^{T\times T}. $$

Each entry \(S_{ij}\) measures how much token \(i\) attends to token \(j\).

Scaled dot-product attention

The transformer uses scaled dot-product attention:

$$ \operatorname{Attention}(Q,K,V) {}={} \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)V. $$

This is the central formula of transformer mathematics.

The scaling factor:

$$ \frac{1}{\sqrt{d_k}} $$

prevents dot products from becoming too large when dimension \(d_k\) is large.

The Transformer paper introduced scaled dot-product attention and multi-head attention as core components of the architecture. (arXiv)

Attention weights

Define:

$$ A= \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right). $$

Then:

$$ A\in\mathbb{R}^{T\times T}. $$

Each row sums to 1:

$$ \sum_{j=1}^{T}A_{ij}=1. $$

The output is:

$$ O=AV. $$

For token \(i\):

$$ o_i {}={} \sum_{j=1}^{T}A_{ij}v_j. $$

So attention outputs a weighted average of value vectors.

Causal masking

For language generation, token \(t\) must not see future tokens.

So we use a causal mask.

Without masking, token \(i\) could attend to token \(j>i\), which would leak the answer during training.

The masked attention score is:

$$ \tilde{S}_{ij} {}={} \begin{cases} S_{ij}, & j\leq i,\ -\infty, & j>i. \end{cases} $$

Then:

$$ A= \operatorname{softmax} \left( \frac{\tilde{S}}{\sqrt{d_k}} \right). $$

This ensures:

$$ A_{ij}=0 \quad \text{for} \quad j>i. $$

So each token only attends to itself and previous tokens.

Multi-head attention

Instead of one attention operation, transformers use multiple heads.

For head \(h\):

$$ Q_h=XW_Q^{(h)}, $$$$ K_h=XW_K^{(h)}, $$$$ V_h=XW_V^{(h)}. $$

Then:

$$ O_h= \operatorname{Attention}(Q_h,K_h,V_h). $$

The heads are concatenated:

$$ O= \operatorname{Concat}(O_1,\dots,O_H)W_O. $$

Multi-head attention allows different heads to learn different relationships.

One head may focus on syntax.

Another may focus on long-range references.

Another may focus on local patterns.

Mathematically, each head is a separate learned bilinear similarity mechanism.

Attention complexity

For sequence length \(T\), attention forms:

$$ QK^\top\in\mathbb{R}^{T\times T}. $$

So standard attention requires:

$$ O(T^2) $$

attention scores.

This is the main long-context bottleneck.

If \(T=4{,}000\), then:

$$ T^2=16{,}000{,}000. $$

If \(T=128{,}000\), then:

$$ T^2=16{,}384{,}000{,}000. $$

So long-context LLMs require efficient attention algorithms, memory management, KV compression, sparse patterns, or alternative sequence models.

FlashAttention addresses this by computing exact attention with an IO-aware tiled algorithm that reduces memory reads and writes between GPU memory levels. (arXiv)

FlashAttention-2 further improves speed through better work partitioning and parallelism. (arXiv)

Part IV — Positional information

Why position matters

Attention by itself is permutation-invariant.

If we only give the model a bag of token vectors, it does not know which token came first.

For example:

dog bites man

and:

man bites dog

contain the same words but different meanings.

So transformers need positional information.

Absolute positional embeddings

A simple approach is to learn a position embedding matrix:

$$ P\in\mathbb{R}^{T_{\max}\times d}. $$

For position \(t\), add:

$$ x_t = e_t+p_t. $$

So token identity and position are combined:

$$ X = E_{\text{tokens}} + P_{\text{positions}}. $$

Sinusoidal positional encoding

The original Transformer used sinusoidal positional encodings:

$$ PE_{(pos,2i)} {}={} \sin \left( \frac{pos}{10000^{2i/d}} \right), $$$$ PE_{(pos,2i+1)} {}={} \cos \left( \frac{pos}{10000^{2i/d}} \right). $$

These provide deterministic position-dependent vectors.

Rotary position embedding

Rotary Position Embedding, or RoPE, applies position-dependent rotations to query and key vectors.

At a high level:

$$ q_t \mapsto R_t q_t, $$$$ k_s \mapsto R_s k_s. $$

Then the attention score becomes:

$$ (R_tq_t)^\top(R_sk_s). $$

Because rotations interact through relative differences, RoPE naturally introduces relative position information into attention.

The RoFormer paper describes RoPE as encoding absolute position with a rotation matrix while incorporating explicit relative-position dependency in self-attention. (arXiv)

Part V — Normalization, residuals, and feed-forward networks

Residual connections

A transformer block usually uses residual connections.

If a sublayer computes:

$$ F(X), $$

then the residual output is:

$$ Y=X+F(X). $$

Residual connections help gradients flow through many layers.

They also allow a layer to learn a correction rather than a full transformation.

Layer normalization

Layer normalization normalizes a vector across its feature dimension.

For:

$$ x\in\mathbb{R}^{d}, $$

define:

$$ \mu {}={} \frac{1}{d}\sum_{i=1}^{d}x_i, $$$$ \sigma^2 {}={} \frac{1}{d}\sum_{i=1}^{d}(x_i-\mu)^2. $$

Then:

$$ \operatorname{LayerNorm}(x) {}={} \gamma \odot \frac{x-\mu}{\sqrt{\sigma^2+\epsilon}} + \beta. $$

Here:

  • \(\gamma\) is learned scale,
  • \(\beta\) is learned shift,
  • \(\epsilon\) prevents division by zero.

RMSNorm

RMSNorm removes mean-centering and uses root mean square normalization:

$$ \operatorname{RMS}(x) {}={} \sqrt{ \frac{1}{d} \sum_{i=1}^{d}x_i^2 }. $$

Then:

$$ \operatorname{RMSNorm}(x) {}={} \gamma \odot \frac{x}{\operatorname{RMS}(x)+\epsilon}. $$

RMSNorm was proposed as a computationally simpler alternative to LayerNorm that keeps rescaling invariance while avoiding recentering. (arXiv)

Feed-forward network

After attention, each token passes through a feed-forward network.

A standard feed-forward block is:

$$ \operatorname{FFN}(x) {}={} W_2\phi(W_1x+b_1)+b_2. $$

Here:

  • \(W_1\in\mathbb{R}^{d\times d_{\text{ff}}}\),
  • \(W_2\in\mathbb{R}^{d_{\text{ff}}\times d}\),
  • \(\phi\) is a nonlinear activation.

The feed-forward network is applied independently to each token.

So attention mixes information across positions, while the FFN transforms each token representation.

Gated feed-forward networks

Many modern transformers use gated FFNs.

A common form is:

$$ \operatorname{FFN}(x) {}={} (W_gx \odot \phi(W_ux))W_d. $$

Here:

  • \(W_g\) produces a gate,
  • \(W_u\) produces candidate features,
  • \(\odot\) is elementwise multiplication,
  • \(W_d\) projects back to model dimension.

This allows the model to control which features pass through.

Part VI — The full decoder-only transformer

One transformer block

A simplified decoder block is:

$$ Y = X + \operatorname{Attention}(\operatorname{Norm}(X)), $$$$ Z = Y + \operatorname{FFN}(\operatorname{Norm}(Y)). $$

So each block contains:

  1. normalization,
  2. causal self-attention,
  3. residual connection,
  4. normalization,
  5. feed-forward network,
  6. residual connection.

Stack \(L\) blocks:

$$ X^{(0)} \to X^{(1)} \to \cdots \to X^{(L)}. $$

At the end:

$$ H=X^{(L)}. $$

The final logits are:

$$ Z=HW_{\text{out}}. $$

If:

$$ H\in\mathbb{R}^{T\times d}, $$

and:

$$ W_{\text{out}}\in\mathbb{R}^{d\times V}, $$

then:

$$ Z\in\mathbb{R}^{T\times V}. $$

Each row gives logits for the next-token distribution.

Weight tying

Often the output matrix is tied to the input embedding matrix:

$$ W_{\text{out}}=E^\top. $$

If:

$$ E\in\mathbb{R}^{V\times d}, $$

then:

$$ E^\top\in\mathbb{R}^{d\times V}. $$

This reduces parameters and connects input and output token representations.

Part VII — Backpropagation and optimization

Parameters

Let all model parameters be collected into:

$$ \theta. $$

This includes:

  • token embeddings,
  • attention matrices,
  • feed-forward matrices,
  • normalization parameters,
  • output projection,
  • router parameters in MoE models,
  • adapter or LoRA parameters if fine-tuning.

Training minimizes:

$$ \mathcal{L}(\theta). $$

The mathematical goal is:

$$ \theta^\star {}={} \arg\min_\theta \mathcal{L}(\theta). $$

Gradient descent

The gradient is:

$$ \nabla_\theta \mathcal{L}(\theta). $$

Gradient descent updates:

$$ \theta_{k+1} {}={} \theta_k - \eta \nabla_\theta \mathcal{L}(\theta_k). $$

Here:

  • \(\eta\) is learning rate,
  • \(k\) is optimization step.

For LLMs, the loss is computed on minibatches.

So we use stochastic gradients:

$$ g_k {}={} \nabla_\theta \mathcal{L}_{\mathcal{B}_k}(\theta_k). $$

Backpropagation

Backpropagation applies the chain rule through the computational graph.

For a composition:

$$ y=f(g(h(x))), $$

the derivative is:

$$ \frac{dy}{dx} {}={} \frac{dy}{df} \frac{df}{dg} \frac{dg}{dh} \frac{dh}{dx}. $$

In matrix form, backpropagation repeatedly multiplies local Jacobians.

For a linear layer:

$$ Y=XW, $$

if:

$$ G_Y=\frac{\partial \mathcal{L}}{\partial Y}, $$

then:

$$ \frac{\partial \mathcal{L}}{\partial W} {}={} X^\top G_Y, $$

and:

$$ \frac{\partial \mathcal{L}}{\partial X} {}={} G_Y W^\top. $$

Stanford CS231n notes explicitly derive minibatch backpropagation for a linear layer \(Y=XW\). (CS231n)

Adam optimizer

Adam uses moving averages of gradients and squared gradients.

Given gradient:

$$ g_t=\nabla_\theta \mathcal{L}_t(\theta_t), $$

Adam computes:

$$ m_t=\beta_1m_{t-1}+(1-\beta_1)g_t, $$$$ v_t=\beta_2v_{t-1}+(1-\beta_2)g_t^2. $$

Bias-corrected estimates are:

$$ \hat{m}_t=\frac{m_t}{1-\beta_1^t}, $$$$ \hat{v}_t=\frac{v_t}{1-\beta_2^t}. $$

Then:

$$ \theta_{t+1} {}={} \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}. $$

Adam was introduced as a first-order stochastic optimization method using adaptive estimates of lower-order moments. (arXiv)

AdamW

Weight decay penalizes large weights.

A naive \(L^2\) penalty adds:

$$ \lambda |\theta|_2^2 $$

to the loss.

AdamW decouples weight decay from the adaptive gradient update.

A simplified AdamW update is:

$$ \theta_{t+1} {}={} \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon} - \eta\lambda\theta_t. $$

The AdamW paper shows that \(L^2\) regularization and weight decay are not equivalent for adaptive optimizers such as Adam, and proposes decoupled weight decay. (arXiv)

Learning-rate schedules

LLM training often uses a schedule:

  1. warmup,
  2. decay,
  3. sometimes a final constant or cosine tail.

During warmup:

$$ \eta_t {}={} \eta_{\max}\frac{t}{T_{\text{warmup}}}. $$

A cosine decay may be:

$$ \eta_t {}={} \eta_{\min} + \frac{1}{2} (\eta_{\max}-\eta_{\min}) \left[ 1+\cos\left(\frac{\pi t}{T}\right) \right]. $$

The purpose is to avoid instability early and reduce noise later.

Gradient clipping

If gradients become too large, training may diverge.

Gradient norm clipping uses:

$$ g \leftarrow g\cdot \min \left( 1, \frac{c}{|g|_2} \right). $$

If:

$$ |g|_2\leq c, $$

nothing changes.

If:

$$ |g|_2>c, $$

the gradient is rescaled.

Part VIII — Scaling laws and compute-optimal training

Why scaling laws matter

LLMs are expensive.

We must decide:

  • how many parameters,
  • how many training tokens,
  • how much compute,
  • how much data quality filtering,
  • how much context length,
  • how many experts,
  • how many training stages.

Scaling laws try to describe how loss changes as model size, data size, and compute change.

Kaplan et al. found empirical power-law relationships between cross-entropy loss and model size, dataset size, and training compute over large ranges. (arXiv)

Power-law form

A simplified scaling law is:

$$ L(N) {}={} L_\infty + aN^{-\alpha}. $$

Here:

  • \(L(N)\) is loss,
  • \(N\) is number of parameters,
  • \(L_\infty\) is irreducible loss,
  • \(a\) is a constant,
  • \(\alpha\) is scaling exponent.

Similarly, for dataset size \(D\):

$$ L(D) {}={} L_\infty + bD^{-\beta}. $$

For compute \(C\):

$$ L(C) {}={} L_\infty + cC^{-\gamma}. $$

The exact exponents are empirical.

The important idea is:

loss tends to improve predictably with scale, but with diminishing returns.

Compute-optimal training

A training run has limited compute.

Roughly:

$$ C \propto ND, $$

where:

  • \(N\) is parameter count,
  • \(D\) is number of training tokens.

If \(N\) is too large and \(D\) too small, the model is undertrained.

If \(D\) is large and \(N\) too small, the model may be capacity-limited.

The compute-optimal training paper found that model size and number of training tokens should be scaled approximately equally for compute-optimal training, challenging earlier trends of training very large models on relatively fewer tokens. (arXiv)

So the training design problem becomes:

$$ \min_{N,D} L(N,D) \quad \text{subject to} \quad C(N,D)\leq C_{\max}. $$

This is an optimization problem over architecture and data.

Part IX — Mixture-of-Experts mathematics

Dense versus sparse models

In a dense transformer, every token uses the same feed-forward parameters.

In a mixture-of-experts model, a token is routed to a subset of experts.

Let there be \(E\) experts:

$$ f_1,\dots,f_E. $$

A router produces scores:

$$ r(x)=W_rx. $$

Then probabilities:

$$ p(x)=\operatorname{softmax}(r(x)). $$

The model selects the top \(K\) experts.

A sparse MoE output may be:

$$ y {}={} \sum_{e\in \operatorname{TopK}(p(x))} p_e(x)f_e(x). $$

Switch Transformer simplified MoE routing by selecting different parameters for different incoming examples, producing sparsely activated models with large parameter counts but roughly controlled per-token compute. (arXiv)

Why MoE is mathematical

MoE introduces several mathematical problems:

  • discrete routing,
  • load balancing,
  • sparse computation,
  • expert specialization,
  • communication cost,
  • capacity constraints,
  • routing instability.

A load-balancing loss may encourage tokens to spread across experts.

If:

$$ f_e $$

is the fraction of tokens routed to expert \(e\), and:

$$ p_e $$

is the average router probability for expert \(e\), one balancing objective may penalize unevenness:

$$ \mathcal{L}_{\text{balance}} \propto E\sum_{e=1}^{E} f_ep_e. $$

Recent public MoE reports explore more specialized expert structures, shared experts, fine-grained experts, and efficient inference through attention and expert sparsity. (arXiv)

Part X — Efficient attention and inference mathematics

KV cache

During autoregressive inference, the model generates one token at a time.

At each layer, attention needs keys and values for previous tokens.

For previous sequence length \(t\), store:

$$ K_{1:t},V_{1:t}. $$

This is called the KV cache.

Without KV cache, the model would recompute keys and values for all previous tokens at every step.

With KV cache, each new token only computes its new key and value, then attends to cached previous keys and values.

KV cache size

For one layer, approximate KV cache size is:

$$ 2 \times B \times T \times H_{\text{kv}} \times d_h. $$

Here:

  • factor 2 is for keys and values,
  • \(B\) is batch size,
  • \(T\) is context length,
  • \(H_{\text{kv}}\) is number of KV heads,
  • \(d_h\) is head dimension.

For \(L\) layers:

$$ \text{KV size} \propto 2LBTH_{\text{kv}}d_h. $$

So long-context inference is often memory-bandwidth limited.

Multi-query attention

Multi-query attention shares keys and values across query heads.

Instead of:

$$ H_Q=H_K=H_V, $$

MQA uses:

$$ H_K=H_V=1. $$

This reduces KV cache size.

The multi-query attention paper proposed sharing keys and values across heads to reduce memory bandwidth requirements during incremental decoding. (arXiv)

Grouped-query attention

Grouped-query attention is between MHA and MQA.

If there are \(H_Q\) query heads and \(H_{KV}\) KV heads, then:

$$ 1 < H_{KV}GQA generalizes MQA by using an intermediate number of key-value heads, giving quality closer to multi-head attention and speed closer to MQA. (arXiv)

Low-rank KV compression

Some efficient inference methods compress keys and values through a latent representation.

A simplified low-rank idea is:

$$ C_t = x_t W_C, $$

where:

$$ C_t\in\mathbb{R}^{r}, \qquad rThen keys and values are reconstructed or projected from this compressed latent:

$$ K_t=C_tW_K', $$$$ V_t=C_tW_V'. $$

This reduces cached state if \(r\) is much smaller than the full KV dimension.

Recent public reports describe multi-head latent attention using low-rank key-value joint compression to reduce KV cache while preserving performance. (arXiv)

FlashAttention mathematics

Standard attention forms:

$$ S=QK^\top, $$

then:

$$ A=\operatorname{softmax}(S), $$

then:

$$ O=AV. $$

The naive implementation materializes \(S\) and \(A\), both of size:

$$ T\times T. $$

FlashAttention computes exact attention using tiling and online softmax, avoiding storage of the full attention matrix.

The key mathematical issue is stable blockwise softmax.

For a row vector \(s\), softmax is:

$$ \operatorname{softmax}(s_i) {}={} \frac{e^{s_i}}{\sum_j e^{s_j}}. $$

For numerical stability:

$$ \operatorname{softmax}(s_i) {}={} \frac{e^{s_i-m}}{\sum_j e^{s_j-m}}, \quad m=\max_j s_j. $$

FlashAttention applies this idea blockwise while tracking running maxima and normalization factors.

The paper emphasizes IO complexity: reducing reads and writes between GPU high-bandwidth memory and on-chip SRAM. (arXiv)

Part XI — Sampling and decoding

Greedy decoding

At generation step \(t\), the model outputs:

$$ p_t=\operatorname{softmax}(z_t). $$

Greedy decoding chooses:

$$ x_{t+1}=\arg\max_i p_{t,i}. $$

This is deterministic.

It often produces stable but sometimes repetitive text.

Temperature sampling

Temperature modifies logits:

$$ p_i {}={} \frac{\exp(z_i/\tau)} {\sum_j \exp(z_j/\tau)}. $$

Here:

  • \(\tau<1\): sharper distribution,
  • \(\tau=1\): unchanged,
  • \(\tau>1\): flatter distribution.

Low temperature makes the model more deterministic.

High temperature makes it more random.

Top-k sampling

Top-k sampling keeps only the \(k\) most likely tokens.

Let:

$$ S_k {}={} \text{indices of top } k \text{ probabilities}. $$

Then:

$$ \tilde{p}_i {}={} \begin{cases} \frac{p_i}{\sum_{j\in S_k}p_j}, & i\in S_k,\ 0, & i\notin S_k. \end{cases} $$

Then sample from \(\tilde{p}\).

Nucleus sampling

Nucleus sampling, or top-p sampling, chooses the smallest set \(S_p\) such that:

$$ \sum_{i\in S_p}p_i \geq p. $$

Then it samples from the renormalized distribution over \(S_p\).

This adapts the number of candidate tokens based on uncertainty.

Beam search keeps \(K\) candidate sequences.

For a sequence:

$$ x_{1:T}, $$

score is usually:

$$ \log p_\theta(x_{1:T}) {}={} \sum_{t=1}^{T} \log p_\theta(x_t\mid x_{Beam search expands candidates and keeps the best \(K\).

It is common in translation, but less central in open-ended dialogue because it can reduce diversity.

Speculative decoding

Speculative decoding uses a smaller draft model to propose multiple tokens, then uses the larger target model to verify them.

Let the draft distribution be:

$$ q(x) $$

and the target distribution be:

$$ p(x). $$

The draft proposes several tokens cheaply.

The target model evaluates them in parallel.

The algorithm accepts or rejects proposed tokens so that the final distribution matches the target model.

Speculative decoding was introduced as a way to make autoregressive transformer inference faster without changing the output distribution, by computing several tokens in parallel using an approximation model. (arXiv)

Part XII — Quantization mathematics

Why quantization matters

LLM inference is expensive because weights and KV caches are huge.

Quantization stores numbers with fewer bits.

For example:

  • FP16: 16 bits,
  • INT8: 8 bits,
  • INT4: 4 bits.

If a model has \(N\) weights, then storage is roughly:

$$ \text{memory} {}={} N \times \text{bits per weight}. $$

Reducing from 16 bits to 4 bits gives about:

$$ 4\times $$

smaller weight storage.

Uniform quantization

A real value \(w\) can be quantized as:

$$ q {}={} \operatorname{round} \left( \frac{w}{s} \right), $$

then reconstructed as:

$$ \hat{w}=sq. $$

Here \(s\) is a scale.

For a group of weights:

$$ s= \frac{\max |w|}{q_{\max}}. $$

The quantization error is:

$$ e=w-\hat{w}. $$

The objective is to keep this error small.

GPTQ-style second-order quantization

Quantization changes weights:

$$ W \to \hat{W}. $$

This changes the layer output:

$$ XW \to X\hat{W}. $$

A second-order approximation uses Hessian information to minimize output error.

GPTQ introduced a one-shot post-training quantization method for generative transformers based on approximate second-order information, reducing weights to 3 or 4 bits with small accuracy degradation in reported experiments. (arXiv)

Activation-aware quantization

Some weights matter more than others because they interact with large activations.

AWQ observes that a small fraction of salient weights can strongly affect quantization error, and uses activation statistics to identify important channels. (arXiv)

The mathematical idea is:

$$ |XW-X\hat{W}| $$

matters more than:

$$ |W-\hat{W}|. $$

So quantization should care about the input activation distribution.

Part XIII — Parameter-efficient fine-tuning

Full fine-tuning

Full fine-tuning updates all parameters:

$$ \theta \leftarrow \theta+\Delta\theta. $$

For very large models, this is expensive.

It requires optimizer states, gradients, and memory for all parameters.

LoRA

LoRA freezes the original weight matrix and learns a low-rank update.

For a weight matrix:

$$ W\in\mathbb{R}^{d_{\text{in}}\times d_{\text{out}}}, $$

instead of learning a full update:

$$ \Delta W, $$

LoRA learns:

$$ \Delta W=AB, $$

where:

$$ A\in\mathbb{R}^{d_{\text{in}}\times r}, $$$$ B\in\mathbb{R}^{r\times d_{\text{out}}}, $$

and:

$$ r\ll \min(d_{\text{in}},d_{\text{out}}). $$

The adapted layer is:

$$ y=x(W+AB). $$

LoRA freezes pretrained weights and injects trainable low-rank matrices, greatly reducing trainable parameters and memory compared with full fine-tuning. (arXiv)

Part XIV — Alignment mathematics

Why pretraining is not enough

A pretrained LLM learns to predict text.

But predicting internet text is not the same as following user intent.

Instruction tuning and preference optimization try to modify behavior.

The InstructGPT/RLHF paper showed that fine-tuning with human feedback can make models better aligned with user intent, and that larger pretrained models are not automatically better at following instructions. (arXiv)

Supervised fine-tuning

Given instruction-response pairs:

$$ (x,y), $$

SFT minimizes:

$$ \mathcal{L}_{\text{SFT}} {}={} -\sum_t \log \pi_\theta(y_t\mid x,y_{This is still cross-entropy.

The difference is the dataset: instead of arbitrary next-token text, the data contains desired responses.

Reward modeling

Suppose humans compare two responses:

$$ y^+ \succ y^-. $$

A reward model assigns scores:

$$ r_\phi(x,y). $$

A common preference loss is Bradley–Terry style:

$$ \mathcal{L}_{\text{RM}} {}={} -\log \sigma \left( r_\phi(x,y^+)-r_\phi(x,y^-) \right). $$

This trains the reward model to score preferred responses higher.

RLHF objective

A common RLHF objective is:

$$ \max_\theta \mathbb{E}{y\sim \pi\theta(\cdot\mid x)} [ r_\phi(x,y) ]

\beta D_{\mathrm{KL}} \left( \pi_\theta(\cdot\mid x) ,|, \pi_{\text{ref}}(\cdot\mid x) \right). $$

The first term encourages high reward.

The KL term prevents the policy from drifting too far from a reference model.

PPO

PPO uses a clipped surrogate objective.

Let:

$$ \rho_t(\theta) {}={} \frac{\pi_\theta(a_t\mid s_t)} {\pi_{\theta_{\text{old}}}(a_t\mid s_t)}. $$

The PPO clipped objective is:

$$ \mathcal{L}^{\text{CLIP}}(\theta) {}={} \mathbb{E} \left[ \min \left( \rho_t(\theta)A_t, \operatorname{clip}(\rho_t(\theta),1-\epsilon,1+\epsilon)A_t \right) \right]. $$

PPO was introduced as a policy-gradient method that alternates between sampling and optimizing a clipped surrogate objective. (arXiv)

Direct Preference Optimization

DPO avoids explicitly training a reward model and running RL.

For a preferred response \(y^+\) and rejected response \(y^-\), DPO uses a loss of the form:

$$ \mathcal{L}_{\text{DPO}} {}=

\log \sigma \left( \beta \left[ \log \frac{\pi_\theta(y^+\mid x)} {\pi_{\text{ref}}(y^+\mid x)}

  • \log \frac{\pi_\theta(y^-\mid x)} {\pi_{\text{ref}}(y^-\mid x)} \right] \right). $$

DPO shows that a preference optimization problem related to RLHF can be solved with a simple classification-style objective, without explicit reward modeling or reinforcement-learning sampling during fine-tuning. (arXiv)

Part XV — Reasoning-time and multi-token objectives

Test-time computation

Some modern systems improve answers by spending more computation during inference.

Mathematically, instead of producing one short continuation immediately, the model may sample, verify, revise, or search.

A general inference-time objective is:

$$ y^\star {}={} \arg\max_y S(x,y), $$

where \(S\) may combine:

  • model likelihood,
  • reward score,
  • verifier score,
  • constraint satisfaction,
  • tool result correctness,
  • self-consistency.

Self-consistency

Instead of one reasoning path, sample \(M\) paths:

$$ y^{(1)},\dots,y^{(M)}. $$

Then aggregate answers.

For classification-like answers:

$$ \hat{a} {}={} \operatorname{mode} {a^{(1)},\dots,a^{(M)}}. $$

This reduces variance when reasoning paths are noisy.

Multi-token prediction

Standard language modeling predicts one next token:

$$ x_{t+1}. $$

Multi-token prediction asks the model to predict several future tokens:

$$ x_{t+1},x_{t+2},\dots,x_{t+n}. $$

A multi-token objective may be:

$$ \mathcal{L} {}={} -\sum_{k=1}^{n} \log p_\theta^{(k)}(x_{t+k}\mid x_{\leq t}). $$

The multi-token prediction paper argues that predicting multiple future tokens can improve sample efficiency and may improve downstream generative performance. (arXiv)

Reinforcement learning for reasoning

Recent public research has explored using reinforcement learning to improve reasoning behavior.

In a simplified setting, suppose the model outputs a solution \(y\) to a problem \(x\), and receives reward:

$$ R(x,y). $$

The objective is:

$$ \max_\theta \mathbb{E}_{y\sim\pi_\theta(\cdot\mid x)} [ R(x,y) ]. $$

For mathematical problems, reward may be rule-based:

$$ R(x,y)= \begin{cases} 1, & \text{answer correct},\ 0, & \text{answer incorrect}. \end{cases} $$

Recent reasoning-RL work reports that reasoning behaviors can be incentivized through large-scale RL, including self-reflection and verification patterns. (arXiv)

Part XVI — Distributed training mathematics

Why distributed training is needed

Large models do not fit on one GPU.

Training requires distributing:

  • data,
  • parameters,
  • gradients,
  • optimizer states,
  • activations,
  • attention context,
  • experts.

Data parallelism

Each GPU gets a different batch shard.

If the full batch is:

$$ \mathcal{B} {}={} \mathcal{B}_1\cup\cdots\cup\mathcal{B}_G, $$

then GPU \(g\) computes:

$$ g_g {}={} \nabla_\theta \mathcal{L}_{\mathcal{B}_g}(\theta). $$

Then gradients are averaged:

$$ g {}={} \frac{1}{G} \sum_{g=1}^{G}g_g. $$

All GPUs update the same parameters.

Tensor parallelism

A matrix multiplication:

$$ Y=XW $$

can be split across GPUs.

If:

$$ W=[W_1,W_2,\dots,W_G], $$

then:

$$ Y_g=XW_g. $$

Then concatenate:

$$ Y=[Y_1,Y_2,\dots,Y_G]. $$

Megatron-LM introduced efficient intra-layer model parallelism for training multi-billion-parameter transformer language models. (arXiv)

Pipeline parallelism

Layers are split across devices.

If:

$$ f=f_L\circ f_{L-1}\circ\cdots\circ f_1, $$

then different devices hold different blocks of layers.

Microbatches flow through the pipeline.

The mathematical issue is balancing computation and minimizing idle time.

ZeRO-style optimizer sharding

For Adam, memory includes:

  • parameters \(\theta\),
  • gradients \(g\),
  • first moment \(m\),
  • second moment \(v\).

ZeRO partitions optimizer states, gradients, and parameters across devices to reduce memory redundancy.

ZeRO was proposed to remove memory redundancy in data/model parallel training and scale trainable model size with the number of devices. (arXiv)

Part XVII — Alternative sequence models

Why alternatives exist

Standard attention has:

$$ O(T^2) $$

complexity.

For very long sequences, this is expensive.

Alternative sequence models try to reduce complexity to:

$$ O(T) $$

or:

$$ O(T\log T). $$

State-space models

A continuous-time state-space model has:

$$ \frac{dh(t)}{dt} {}={} Ah(t)+Bx(t), $$$$ y(t)=Ch(t)+Dx(t). $$

Discretized:

$$ h_t=\bar{A}h_{t-1}+\bar{B}x_t, $$$$ y_t=Ch_t+Dx_t. $$

Selective state-space models make some parameters depend on the input.

Mamba identifies input-dependent selective state-space parameters as important for discrete data such as language and proposes hardware-aware algorithms with linear scaling in sequence length. (arXiv)

Recurrent-transformer hybrids

Some architectures try to combine:

  • transformer-like parallel training,
  • recurrent-like efficient inference.

RWKV proposes a model that can be formulated like a transformer during training and like an RNN during inference, with linear complexity. (arXiv)

Part XVIII — The complete mathematical map

The mathematics of LLMs includes:

LLM componentMathematics
Tokenizationdiscrete mathematics, compression, subword statistics
Embeddingslinear algebra, vector spaces
Language modelingprobability, conditional distributions, chain rule
Softmaxexponential families, normalization
Cross-entropyinformation theory, maximum likelihood
Attentionmatrix multiplication, bilinear forms, kernels
Multi-head attentiontensor algebra, block matrix operations
Positional encodingFourier features, rotations, relative geometry
Transformer blockresidual dynamical systems, nonlinear maps
Normalizationstatistics, scale invariance
Backpropagationmultivariable calculus, chain rule, Jacobians
Optimizationstochastic optimization, adaptive methods
Scaling lawsempirical power laws, constrained optimization
MoEsparse routing, discrete optimization, load balancing
Long contextquadratic complexity, memory bandwidth, approximation
KV cachetensor storage, incremental computation
Quantizationnumerical analysis, approximation, information loss
LoRAlow-rank matrix factorization
RLHFreinforcement learning, KL regularization
DPOpreference learning, logistic loss, implicit rewards
Distributed trainingparallel algorithms, communication complexity
Reasoning-time computesearch, sampling, verification, policy optimization

Section summary

A large language model is a probability model over token sequences.

The chain rule gives:

$$ p_\theta(x_1,\dots,x_T) {}={} \prod_{t=1}^{T} p_\theta(x_t\mid x_{Training minimizes negative log-likelihood:

$$ \mathcal{L} {}={} -\sum_t \log p_\theta(x_t\mid x_{Tokens are mapped into vectors using an embedding matrix:

$$ E\in\mathbb{R}^{V\times d}. $$

A sequence becomes:

$$ X\in\mathbb{R}^{T\times d}. $$

A transformer block applies attention:

$$ \operatorname{Attention}(Q,K,V) {}={} \operatorname{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right)V. $$

Causal masking prevents future-token leakage.

Multi-head attention repeats attention in parallel subspaces.

Feed-forward networks add nonlinear feature transformation.

Backpropagation computes gradients.

AdamW updates parameters using adaptive moments and decoupled weight decay.

Scaling laws describe how loss changes with parameters, data, and compute.

MoE models activate only some experts per token.

Inference uses KV caching, sampling, decoding algorithms, quantization, and sometimes speculative decoding.

Alignment uses supervised fine-tuning, reward models, RLHF, PPO, DPO, and other preference-optimization methods.

The core idea is:

an LLM is a trained probabilistic dynamical computation over token tensors, optimized at massive scale to predict, transform, and generate language.

At the deepest level, LLM mathematics combines:

$$ \text{linear algebra} + \text{probability} + \text{information theory} + \text{optimization} + \text{numerical analysis} + \text{parallel computing} + \text{reinforcement learning}. $$

Source anchors used for this section

  • Transformer architecture, scaled dot-product attention, multi-head attention: Vaswani et al., Attention Is All You Need. (arXiv)
  • BPE subword tokenization: Sennrich, Haddow, and Birch. (arXiv)
  • SentencePiece tokenizer: Kudo and Richardson. (arXiv)
  • Cross-entropy and neural-network likelihood: Deep Learning book and Stanford CS224N material. (Deep Learning Book)
  • RoPE positional encoding: RoFormer. (arXiv)
  • RMSNorm: Zhang and Sennrich. (arXiv)
  • Scaling laws: Kaplan et al. (arXiv)
  • Compute-optimal training: Hoffmann et al. (arXiv)
  • FlashAttention and FlashAttention-2: Dao et al. (arXiv)
  • MoE and Switch Transformers: Fedus, Zoph, and Shazeer. (arXiv)
  • Recent MoE, MLA-style KV compression, and multi-token prediction reports: public technical reports and papers. (arXiv)
  • Adam and AdamW: Kingma & Ba; Loshchilov & Hutter. (arXiv)
  • ZeRO and tensor/model parallel training: Rajbhandari et al.; Shoeybi et al. (arXiv)
  • RLHF, PPO, DPO, and reasoning-RL literature: Ouyang et al.; Schulman et al.; Rafailov et al.; recent public RL reasoning papers. (arXiv)
  • Quantization and speculative decoding: GPTQ, AWQ, speculative decoding. (arXiv)
  • LoRA: Hu et al. (arXiv)
  • Efficient long-sequence alternatives: MQA, GQA, Mamba, RWKV. (arXiv)