Modern language models like Phi-3 or LLaMA are often treated as black boxes — you feed them text and get intelligent answers. But beneath that, they are nothing more than massive, structured matrices of numbers (parameters) performing linear algebra operations in sequence. To truly understand how these models think, we must follow the journey of an input token through every stage of computation — from text to logits — and observe how parameters shape meaning.
1. Tokenization: Text to Numbers
Every input string is first broken into tokens — discrete integer IDs mapped by a vocabulary.
Example:
“Hello world” → [15496, 995]
Each token ID is an index into an embedding matrix E of shape (vocab_size, embedding_dim).
At the parameter level:
x₀ = E[token_id]
This vector x₀ (say 4096-dimensional) is the model’s numeric representation of the word.
2. Embedding and Positional Encoding
Language models are sequential, so they need to know where each token occurs.
A positional encoding (learned or sinusoidal) is added to each embedding:
x₀ = E[token_id] + P[position]
-
E= learned token embeddings -
P= learned positional embeddings
Both are stored in the model’s parameters and updated during training.
3. Transformer Layers: Parameterized Flow of Information
The input now flows through multiple identical Transformer blocks (e.g., 32–80 layers).
Each layer has two main parts:
-
Multi-Head Self-Attention (MHSA)
-
Feedforward Network (FFN)
Let’s zoom in to the parameter level.
(a) Self-Attention: The Dynamic RouterEach token embedding is linearly projected into three spaces using trainable matrices:
Q = xW_Q
K = xW_K
V = xW_V
Here:
-
W_Q, W_K, W_Vare parameter matrices (each of sized_model × d_head). -
These matrices are what the model learns to detect relationships between tokens.
The attention weights are computed as:
A = softmax(QKᵀ / √d_head)
This determines how much each token should attend to others.
Then the weighted sum of values gives:
z = A × V
The combined attention output passes through another learned projection:
x₁ = zW_O
where W_O is the output projection matrix.
At this point, each token’s representation has mixed information from all other tokens, guided entirely by learned matrices W_Q, W_K, W_V, W_O.
4. Feedforward Network: Nonlinear Transformation
After attention, the model applies a two-layer MLP to each token independently:
h₁ = x₁W₁ + b₁
h₂ = GELU(h₁)
x₂ = h₂W₂ + b₂
-
W₁andW₂are parameter matrices of large size (e.g., 4096×11008). -
This expands and compresses the token’s hidden representation, enabling nonlinear mixing of semantic features.
Each layer updates the representation:
x ← x + LayerNorm(x₂)
residual connections ensure stable gradient flow and memory of previous states.
5. The Final Projection: Turning Thought into Words
After the last transformer block, we obtain a final hidden state h_final for each token.
To predict the next word, we project h_final back to vocabulary space using the same embedding matrix Eᵀ:
logits = h_final × Eᵀ
This gives one score per vocabulary token — the model’s belief in what comes next.
Applying softmax(logits) yields a probability distribution over all words.
6. Sampling: Converting Probabilities to Output
Finally, the model samples (or picks) the next token:
next_token = argmax(softmax(logits))
or via stochastic sampling (temperature, top-k, nucleus sampling).
This new token becomes input for the next iteration — recursively generating text.
7. Where the “Intelligence” Lives
Every “understanding” or “reasoning” capability of the model is encoded in the millions or billions of numbers inside:
-
W_Q, W_K, W_V, W_O -
W₁, W₂ -
EandP
Each parameter fine-tunes how inputs mix, how attention flows, and how representations evolve.
At scale, these matrices form a distributed semantic memory — not rules, but high-dimensional geometry learned from data.
8. Summary of the Flow
Stage Operation Parameters Output 1 Tokenization Vocabulary Token IDs 2 EmbeddingE, P
Token vectors
3
Attention
W_Q, W_K, W_V, W_O
Contextual features
4
FFN
W₁, W₂
Transformed semantics
5
Output
Eᵀ
Next-token logits
Closing Thought
Understanding a model like Phi-3 or LLaMA at the parameter level reveals a simple but profound truth: these “intelligent” systems are deterministic numerical pipelines. The complexity and creativity we perceive are emergent properties of large-scale optimization in these matrices — a symphony of dot products and nonlinearities that together simulate reasoning.
In essence:
A language model doesn’t “know” words — it shapes probability landscapes where meaning naturally emerges through matrix multiplication.