Love this blog's design? Hire me!
Large Language Models (LLMs) like GPT-3 and GPT-4 have revolutionized how we interact with AI - powering tools that can reason, write, code, and converse fluently. But under the hood, these models are built on simple mathematical foundations: linear algebra, calculus and probability theory.
In this project, we build GPTOSS entirely from first principles - using only Python, and no deep learning frameworks like PyTorch or TensorFlow. We will rely completely on the CPU, so no need for special hardware here.
By the end of this post, you will understand GPTOSS completely, how to implement every single part of it and why every architectural component is there. As of writing, GPTOSS was released 2 months ago and is very close to state-of-the-art closed-source models in terms of performance.
Join my email list for insider updates, early access, and content you won't find on the blog.
#Overview
The GPTOSS architecture processes text through a sequence of stages.
Here’s the high-level flow:
Tokenization - Converts raw text into numerical token IDs.
Embedding - Maps each token ID to a continuous vector representation.
Loop N Times:
- Attention Layer
- RMSNorm
- QKV Extraction
- Positional Encodings (RoPE)
- Attention Mechanism
- Residual Layer
- MLP
- RMSNorm
- Gating Layer (Expert Selection)
- Mixture of Experts
- Residual Layer
- Attention Layer
RMSNorm - Final normalization before output projection.
Unembedding - Projects hidden states back to the vocabulary space for prediction.
In this post, we will walk through each of these components step by step.
To make things concrete, we’ll feed the model a simple sentence:
I am Joe
and trace how the third token Joe
moves through the different layers of GPTOSS - from raw text to its final output.
#Tokenizer
When running an LLM, the first step is to convert text into numbers that can be processed.
We aim to create a mapping between words and numbers. One simple way to do this is by looping through the English dictionary and assigning each word a unique number from 0 up to the dictionary size. We call that number a token. However, this approach has several problems:
- Rigidity: One word always corresponds to one number. Consider the words “try” and “trying.” The “ing” suffix could be added to any verb in the English language, making it inefficient to assign 2 tokens for every single verb, rather than simply assigning the “ing” a separate token.
- Out-of-dictionary words: What if there are words not in the dictionary-such as made-up terms or words from other languages? We could fall back to encoding each character as a token, but this would be inefficient for frequently repeating words that lie outside the official dictionary.
Solution: A tokenizer.
The most common tokenizer algorithm in LLMs is Byte Pair Encoding (BPE), which takes text as input and converts it into a list of tokens.
The BPE algorithm optimizes for frequency-if a sequence of characters or a word appears frequently, the tokenizer assigns a token for it.
At a basic level, every character could be assigned a token, and we could encode all English text using just 26 tokens. However, this would defeat the purpose of tokenization. Instead, we aim to create tokens for as many commonly used words as possible without exceeding the vocabulary size.
The tokenizer performs three main functions: train, encode, and decode.
- Train: Processes the entire training corpus to determine the optimal way to divide text into tokens, then forms a dictionary that maps text fragments to numbers. The tokenizer dictionary for GPTOSS (and other open models) is included as part of their released weights.
- Encode: Takes a string as input and returns a list of tokens (numbers) based on the mapping formed during training.
- Decode: The reverse of encoding-takes a list of tokens and reconstructs the original text.
pythontext = "I am Joe"
tokens = tokenizer.encode(text) # [40, 939, 177941]
assert text == tokenizer.decode(tokens)
The exact implementation of tokenizers is beyond the scope of this tutorial, but you can learn more from this excellent video:
Let's Build the GPT Tokenizer - YouTube
Try it here: Tokenizer - OpenAI API
Pro Tip
The tokenizer may include spaces as part of words if it determines that doing so is more efficient.
Limitations of the BPE Tokenizer
- Lacks character-level awareness, causing LLMs to struggle with counting characters.
- The same word can be tokenized differently based on context (capitalization or surrounding spaces).
- Numbers are tokenized unpredictably (no consistent digit-by-digit pattern).
- Language bias: English and other common languages receive efficient tokenization (fewer tokens per word), while non-Latin scripts or rare languages are heavily fragmented (many tokens per word).
#Embedding Layer
The embedding layer is a lookup table 1Matrix of size vocabulary_size x hidden_size
, where every row in that matrix corresponds to a token in the tokenizer mapping. hidden_size
= 2880 in GPTOSS 20B. that takes each tokenizer output token ID (Index) as input and assigns it to a vector.
The output vector, or embedding vector, represents the semantic meaning of the input token (in the eyes of the LLM, of course). The numbers inside that vector may appear completely random - and they are, in an absolute sense. However, there are two very important geometric properties of that vector:
- Semantic proximity: Similar words are close in distance to each other within the vector space.
- Directional meaning: Vector arithmetic captures conceptual relationships.
The classic example:
So, even though these numbers look random, when comparing two embedding vectors of different words, you can determine how conceptually close they are.
You can think of having a vector of dimension N, where each number describes a particular concept. In the case of embeddings, meanings are encoded in that vector-such as whether something is a living thing or an object, its gender, its color, or its odor, among others. However, each number does not directly correspond to a single concept. Instead, the combination of numbers collectively encodes all of these concepts 2This is a feature, not a bug, since it enables the vector to encode far more concepts than it could with one-to-one encoding. A side effect of this is that these vectors are difficult to interpret. However, research is being conducted to address interpretability, such as in Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet..
This concept of seemingly random numbers that actually encode meaning is the core of how LLMs-and deep learning in general-work.
Pro Tip:
The embedding weights are trainable parameters. During training, the lookup table values update like any other weights in the neural network (during training). Although LLMs are trained on a next-token prediction objective, the embedding layer weights adjust to make similar words closer in the embedding space and dissimilar words farther apart.
In our example, when we feed the output from the tokenizer to the embedding layer
python# tokens = [40, 939, 177941] #output from tokenizer
embedding = embedding(tokens[2]) # Assuming we want to embed the "Joe" token
#embedding shape = (2880,)
We obtain the embedding of that token - a vector of shape (2880,)
in the case of GPTOSS 20B. This embedding then serves as the input to the next layer, the attention layer. However, before we dive into the attention mechanism, let’s first discuss some important building blocks that lead up to it.
Pro Tip: This vector will move through a series of attention and moe layers throughout the network, to have it enriched or Transformed, until it gets unembedded to the vocabulary as we will see. Therefore, we will call this vector the Hidden State Vector and it is of size
hidden_size
= 2880 in GPTOSS 20B (papers also call it the model dimensiond_model
but we will stick withhidden_size
in this blog). Attention layers enrich the vector with context from previous tokens, while MoE layers enrich the vector with information learned during training stage, outside the context.
Join my email list for insider updates, early access, and content you won't find on the blog.
#Linear Layer (Matmul and Bias)
Now we get to implement the first layer! This is the fundamental building block of a neural network. At its core, a neural network is simply a series of linear layers.
A linear layer consists of two basic operations: matrix multiplication (between a matrix and a vector) and element-wise bias addition.
Let's go through each part.
Have you seen this drawing before?
This is what a layer of a neural network looks like. The two blue (circles) neurons on the left represent the input vector. All arrows emerging from a single input neuron correspond to one column in the weight matrix, and the three neurons on the right represent the output.
In that case, we can write it in the form of
The weight must be of shape (out, in), where out is the dimension/length of the output vector and in is the dimension/length of the input vector. 3For example, if we have a neural network that determines whether an animal in a picture is a cat, dog, or mouse, the network could output three values corresponding to each animal-these are the logits.
Note that all arrows emerging from a single neuron correspond to a column in the weight matrix using this notation.
Let's implement it:
pythondef linear(x, W):
"""
W of shape (m,n)
x of shape (n,)
out of shape (m,)
"""
m = len(W)
if m == 0:
return []
n = len(W[0])
out = []
for i in range(m): # for every row in W
acc = 0
for k in range(n): # for every element in that row
a = W[i][k]
b = x[k]
acc += (a * b) # multiply by the corresponding element in x (b) and accumulate
out.append(acc) # the final accumulation is the output element in y
return out
Additionally, some LLMs like GPTOSS include a bias term, which simply adds an element of shape (3,) to the output of the above operation to produce the final output.
Let's add the bias:
pythondef linear(x, W, bias):
"""
W of shape (m,n)
x of shape (n,)
b of shape (m,)
out of shape (m,)
"""
m = len(W)
if m == 0:
return []
n = len(W[0])
out = []
for i in range(m): # for every row in W
acc = 0
for k in range(n): # for every element in that row
a = W[i][k]
b = x[k]
acc += (a * b) # multiply by corresponding element in x (b) and accumulate
acc += bias[i] # add bias
out.append(acc) # the final accumulation is the output element in y
return out
One last detail: Python floats are 64-bit (float64). However, in this context, we are dealing with BF16 weights, biases, and inputs.
We first need to explain the BFloat16 format.
#BFloat16
The first thing you will notice after downloading the weights is that they are stored in a format called BFloat16, or BF16 for short. This format is designed specifically for deep neural networks, where 1 bit is for the sign, 8 bits are preserved for the exponent, and the remaining 7 bits are for the mantissa (fraction).
- Exponent bits affect the dynamic range - the more bits we allocate for the exponent, the larger the dynamic range (the maximum and minimum representable numbers).
- A larger mantissa allows for greater precision in representing numbers.
Note: the standard float16 format has 5 bits for the exponent and 10 bits for the mantissa.
For example, to represent a number like 12.345, we can write:
Therefore, we have 12345 as the mantissa and as the exponent.
Sign: 0
Mantissa: 100010110111
Exponent: 011
Why BF16?
Why do we use BF16 instead of Float32 or Float64?
The answer is simple: BF16 requires half the memory of Float32 and a quarter of the memory of Float64.
It’s important to note that in neural networks, we could use BF16 to store the weights (in memory and on disk). However, when performing a series of multiplications and additions, we should store intermediate results in Float32 (at least for this implementation). This is because:
- Performing calculations in BF16 results in a catastrophic loss of precision, causing the output to deviate significantly from the ground truth.
- We can safely store intermediate results in Float32 without a major memory cost. (Intermediate activations typically consume less memory than weights when performing inference, though during training they can be significant.)
- Storing the final result in BF16 does not significantly degrade performance compared to Float32, as the additional precision provided by Float32 is often unnecessary for neural networks.
As a rule of thumb, we typically load weights in BF16, perform multiply and add operations using Float32 for intermediate results, and then store the final output back in BF16.
To achieve this, we implemented the following functions to simulate rounding between BF16 and F32, since Python defaults to Float64 precision. These functions mimic the precision loss that occurs in real computations:
pythondef f32(x):
return struct.unpack('<f', struct.pack('<f', float(x)))[0]
def bf16(x):
f32_bits = struct.unpack('<I', struct.pack('<f', float(x)))[0]
lsb = (f32_bits >> 16) & 1
rounding_bias = 0x7FFF + lsb
bf16_bits = (f32_bits + rounding_bias) & 0xFFFF0000
return struct.unpack('<f', struct.pack('<I', bf16_bits))[0]
It’s important to note that even after applying these functions, the number will remain in Float64 (8 bytes in memory). These functions simply truncate the extra precision that arises from performing operations in Float64.
Now, returning to our Linear layer, we need to do two things:
Perform multiplications and additions in Float32, as in GPTOSS PyTorch’s official implementation, storing intermediate results in Float32.
Store the final result in BF16.
Here’s the updated implementation:
pythondef linear(x, W, bias):
"""
W of shape (m,n)
x of shape (n,)
b of shape (m,)
out of shape (m,)
"""
m = len(W)
if m == 0:
return []
n = len(W[0])
out = []
for i in range(m):
acc = f32(0.0) # ---> F32
for k in range(n):
a = W[i][k]
b = x[k]
acc += f32(a * b) # ---> F32
acc = f32(acc + f32(bias[i])) # ---> F32
out.append(bf16(acc)) # ---> BF16
return out
It’s important to note that PyTorch uses libraries to perform matrix multiplication. On CPUs, it uses oneDNN. Since these library kernels are optimized, we cannot guarantee that the order of multiplication and addition will match exactly with our implementation.
Pro Tip: Computers violate the associative property in mathematics. In other words, changing the addition order of floating numbers could change the output.4LLMs suffer from non-deterministic output precisely because of that issue. However, it has been solved recently. https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
python(0.1 + 1e20) - 1e20 >> 0 0.1 + (1e20 - 1e20) >> 0.1
To address this, I provide Python bindings for Torch C++ API as well in the Github repository to reproduce precise numerical values from the official implementation.
#Softmax
Next, let's implement the Softmax function, which is used in both the attention mechanism and the mixture of experts.
In deep learning, Softmax is typically used when we have a set of values that we want to convert into a probability distribution - essentially, to make a choice among them.
The Softmax function is defined as follows:
To understand this equation, imagine having a sequence of numbers and dividing each number by the sum of all numbers to determine how much each one "weighs" relative to the total.
For example, for , the number 3 accounts for half of the total since .
Softmax does the same but applies this weighting to the exponentials of the numbers.
A naïve implementation of Softmax might look like this:
pythonimport math
def sum_of_exp(arr):
# accumulate the exponentiation of every element
accumulation = 0
for item in arr:
accumulation += math.exp(item)
def softmax(arr):
out = []
for i, item in enumerate(arr):
out.append(math.exp(item)/sum_of_exp(arr))
return out
However, this implementation suffers from numerical instability.
When the input values are very large, can easily exceed the maximum representable floating-point number (around for float64
), resulting in inf
or NaN
values.
Conversely, when input values are very negative, can underflow to zero. If all exponentials underflow, this leads to a division by zero ().
The solution is to subtract the maximum element from all elements before exponentiation. This works because:
- The largest exponent becomes , avoiding overflow.
- At least one term in the denominator is ≥ 1, preventing division by extremely small numbers.
- The result is mathematically equivalent due to the property:
In other words, since the relative distances between elements remain unchanged, the absolute values don’t matter - the output of Softmax stays the same.
Here’s a numerically stable and more Pythonic implementation:
pythonimport math
def softmax(arr):
max_val = max(arr)
exps = [math.exp(val - max_val) for val in arr]
sum_exps = sum(exps)
return [val / sum_exps for val in exps]
There’s one remaining piece to complete our implementation - it needs to mimic Torch’s BF16 behavior. We can achieve this using the rounding functions we wrote earlier:
pythonimport math
def softmax(arr):
max_val = f32(max(arr)) # --> f32
exps = [math.exp(f32(val) - max_val) for val in arr] # --> f32
sum_exps = sum(exps)
return [bf16(val / sum_exps) for val in exps] # --> bf16
Why Softmax?
Softmax is used to assign weights to elements in an array, producing a probability distribution that represents how likely each element is to be chosen.
The Softmax function has several desirable properties:
- Normalization: All output elements sum to one, giving a valid probability distribution.
- Differentiability: Unlike hard selections (e.g.,
argmax
), Softmax is smooth and differentiable everywhere. This is crucial for gradient-based learning, allowing the model to learn from all classes rather than only the top prediction. - Mathematical convenience: In classification tasks using cross-entropy loss, the combination of Softmax + log + cross-entropy simplifies mathematically, resulting in stable and efficient gradients during training.
Why not just use linear normalization?
Linear normalization divides each element by the sum of all elements, but it has limitations:
- It cannot handle negative numbers.
- It doesn’t scale differences exponentially.
pythondef norm(arr):
return [elem / sum(arr) for elem in arr]
norm([2, 1])
# [0.66, 0.33]
norm([20, 10])
# [0.66, 0.33]
softmax([2, 1])
# [0.73, 0.26]
softmax([20, 10])
# [0.9999, 0.00004]
This property of Softmax - amplifying differences - is desirable.
To illustrate, imagine we’re playing a football match. If I score 2 goals and you score 1, the game was close; if we play again, you might win or draw (though I’m slightly more likely to win).
However, if I score 20 goals and you score 10, the gap is so large that I’m overwhelmingly more likely to keep winning future matches.
Pro Tip:
The Softmax function amplifies differences between values - a property that’s often exactly what we want.
#RMSNorm
Now we will discuss the Normalization layer. This layer operates by taking the hidden state vector and normalizing it, giving us a vector of the same length but with normalized values.
Normalization, in general, is the process of taking a set of numbers and adjusting them to a common scale or distribution. For instance, we could normalize a vector from 0 to 100 to be between 0 to 1
by dividing every element by the range.
Another way to perform normalization is LayerNorm, which is done by calculating the mean and standard deviation across the vector and then subtracting the mean and dividing by the standard deviation, so that all values are centered around 0 and mostly between -1 and 1.
Why do we have a normalization layer?
A major issue that layer norms address is vanishing and exploding gradients, where learning signals during training sometimes compound exponentially as they propagate backward through the network weights. Other times, they may vanish if the multiplication involves numbers less than 1. Normalization helps keep activation values stable. However, layer normalization does not completely solve this problem - residual layers, which we will discuss later, are more crucial for that.
Also, normalization reduces dependence on initialization. Without layer normalization, poor weight initialization can cause immediate gradient problems. Layer normalization makes the network less sensitive to the scale of initialization.
Lastly, Normalization smooths the loss landscape; small changes in weights cause more predictable changes in loss. This often allows higher learning rates and faster convergence during training.
In this architecture (and in most LLMs), RMSNorm is used as the normalization layer. RMSNorm provides some benefits over LayerNorm:
- Calculates the root mean square only, rather than both mean and variance, reducing computational overhead by roughly 10-30%, depending on hardware and implementation, while achieving similar performance in practice.
- RMSNorm’s approach shows that normalizing by the RMS provides sufficient regularization for training stability.
- Fewer parameters (no bias term).
For an input vector , RMSNorm computes:
Where:
- is a learnable scale parameter (gain) for each dimension
- is element-wise multiplication
- is a small constant (e.g., ) for numerical stability (to prevent division by 0)
Unlike LayerNorm, RMSNorm does not subtract the mean - it only normalizes by magnitude. This means the normalized outputs are not zero-centered, though this generally does not degrade performance in LLMs.
Code Implementation:
pythonimport math
def RMSNorm(x, weights, epsilon=1e-5):
avg = sum([f32(item) ** 2 for item in x]) / len(x)
inv_sqrt = 1.0 / math.sqrt(avg + epsilon)
out = []
for i in range(len(x)):
normalized_val = x[i] * inv_sqrt
weighted_val = f32(normalized_val) * f32(weights[i])
out.append(weighted_val)
return [bf16(val) for val in out]
Or a more pythonic way to do it:
pythonimport math
def rmsnorm(x, weights, epsilon=1e-5):
avg = sum(f32(item) ** 2 for item in x) / len(x)
inv_sqrt = 1.0 / math.sqrt(avg + epsilon)
out = [f32(x[i] * inv_sqrt) * f32(weights[i]) for i in range(len(x))]
return [bf16(val) for val in out]
Why not just use the mean? Why use the root mean square?
For example, for the vector [-5, -5, 3, 3], the mean is 0 - and that is clearly not what we want. We need a measure that reflects the magnitude of the values, which RMS provides.
Why do we have a learnable parameter ?
RMS forces all layers to have outputs with , but this might not be optimal for the network. The learnable allows the network to undo or adjust the normalization.
TL;DR: The normalization layer adjusts the values of the hidden state vector to stabilize training in deep neural networks.
Join my email list for insider updates, early access, and content you won't find on the blog.
#Attention
Now, we arrive at one of the two main components in the architecture - the attention layer, which serves as the mechanism for transferring information between tokens.
The attention layer involves the following steps:
- RMSNorm
- Extract QKV
- Positional Encodings (RoPE), discussed later
- Attention mechanism
- Residual Layer
Pro Tip:
Note that in the first layer, the input to the attention - the Hidden State Vector - is only the embedding vector. Thus, it contains information about the current token only.
In later layers, the Hidden State Vector becomes a combination of multiple tokens and processing from the MLP layer. As we progress through the layers, this vector encodes increasingly abstract and high-level concepts.
How does it work?
Let’s go back to our example where we input I am Joe
, and we are currently processing the Joe
token (assuming the first two tokens have already been processed).
We want this token to “see” the previous tokens, extract information from them, and combine it into itself.
First: We first extract three vectors for the current token - Query, Key, and Value - using a linear layer.
In other words, we apply a linear transformation, similar to the one discussed before. The weights of this layer are trainable (updated through backpropagation), and their role is to learn how to extract the best QKV representations given a hidden vector.
Pro Tip:
Each attention layer has its own QKV weights and biases. The first layer operates at the token level (lowest level), while later layers work with higher-level abstract concepts, as words are combined across layers.
For example, GPTOSS 20B has 24 layers.
Next, the output vector from the linear layer is divided into multiple smaller vectors, each forming a separate QKV triplet fed into one attention head.
The Key-Value pairs calculated are stored in the KV-cache for reuse in later tokens (at the corresponding attention layers). Key and Value pairs from previous tokens are retrieved to perform the attention calculation for the current token.
Pro Tip:
The reason we divide QKV into multiple heads is to allow each head to specialize in a particular aspect.
One head might focus on grammar, another on semantic meaning, and so on.
Heads do not communicate with each other; their calculations happen independently.
A single head would force all these different aspects to be compressed into one attention distribution.
Next, we perform a dot product between the query of the current token and the keys of all previous tokens (including the current one).
Q, K pairs that are similar produce a higher dot product.
The output of each dot product is divided by square root of the head_dim
5This scaling prevents the dot products from becoming too large, which would cause the softmax function to produce extremely sharp probability distributions (near 0 or 1), making gradients vanishingly small and hindering training. and then passed through a Softmax function to produce a probability distribution.
The value vectors of all tokens are weighted by this probability distribution and summed together to form the output of that attention head.
Pro Tip:
The reason why we divide byhead_dim
precisely is as follows: When you compute a dot product between two vectors of dimension d, where each element is randomly initialized (typically from a distribution with variance 1), the resulting dot product is the sum of d multiplied terms. By properties of variance:
Variance of a sum of independent random variables = sum of their variances So the dot product has variance proportional to d Standard deviation is therefore proportional to √d
The outputs from all heads are then concatenated into one large vector.
This vector is then passed to the residual layer, which we will discuss shortly.
Pro Tip:
This kind of attention that we explained above - where tokens can only attend to the previous tokens only, is called Causal Attention (also called Masked Attention or Autoregressive Attention). There are other types of attention that are beyond the scope of this blogpost: Bidirectional attention (used in BERT): tokens can attend to all other tokens in both directions Prefix attention: some tokens can attend bidirectionally while others use causal attention Local attention: tokens only attend to a fixed window of nearby tokens Cross-attention: tokens in one sequence attend to tokens in another sequence
hidden_size
, then added (element-wise) to the original input of the attention layerWhy does this work?
You can think of the Query vector as representing what the current token is looking for in other tokens - it encodes the type of information the token seeks.
The Key vector represents what each token offers or markets to other tokens about itself.
If a Key vector is similar to a Query vector, their dot product (and thus the attention score) will be high, meaning the querying token will “attend” to that token.
Finally, the attention scores are used to weight the Value vectors of all tokens, which are summed together to form the output of that token’s attention.
So again, you could think of the attention output of the current token as the weighted sum of the value vectors of all previous tokens, weighed based on the attention score.
It’s important to note that information flows forward only in LLMs - previous tokens cannot attend to future tokens and therefore cannot change their attention weights or values. This is also known as the causal-attention.
Example Attention Implementation
pythondef attention_head(q, k_cache, v_cache, sinks, head_idx, layer_idx, pos):
activations = [0] * (pos + 1)
# QK dot product
for token_idx in range(pos + 1):
for j in range(config.head_dim):
qvk_indx = (head_idx * config.head_dim) + j
activations[token_idx] += k_cache[token_idx][layer_idx][qvk_indx] * q[qvk_indx]
# Divide by root(head_dim)
activations[token_idx] /= math.sqrt(config.head_dim)
# Softmax
activations = softmax(activations)
# Weighted sum by values
out = bf.Array([0] * config.head_dim)
for token_idx in range(pos + 1):
for j in range(config.head_dim):
qvk_indx = (head_idx * config.head_dim) + j
out[j] += activations[token_idx] * v_cache[token_idx][layer_idx][qvk_indx]
return out
#Grouped Attention
Instead of having a Query vector for every Key and Value vector, GPTOSS uses grouped attention, where there are eight times as many Query vectors as there are Key and Value vectors.
For example, in the 20B model, after applying the linear layer, we get a vector of length 5260 - the first 4096 are Queries for 64 heads (each of length 64), and the next 512 are Keys for 8 heads (also of length 64 each).
The first head attends to the first 8 Query vectors, the second to the next 8, and so on.
Why?
This design reduces parameter count and memory footprint (of K and V) with minimal impact on performance, as long as the ratio of K & V to Q is not extreme.(it's 1:8 in this case)
Implementation-wise, we simply ensure that the first Key and Value vectors correspond to the first 8 Query vectors, the second to the next 8, and so forth - so that 8 Key-Value pairs match 64 Query vectors.
Updated implementation:
pythondef grouped_attention_head(q, k_cache, v_cache, sinks, head_idx, layer_idx, pos):
activations = [0] * (pos + 1)
for token_idx in range(pos + 1):
for j in range(config.head_dim):
q_indx = (head_idx * config.head_dim) + j
current_kv_head = head_idx * config.num_key_value_heads // config.num_attention_heads # grouped attention
kv_indx = (config.head_dim * current_kv_head) + j
activations[token_idx] += k_cache[token_idx][layer_idx][kv_indx] * q[q_indx]
activations[token_idx] /= math.sqrt(config.head_dim)
activations = softmax(activations)
out = bf.Array([0] * config.head_dim)
for token_idx in range(pos + 1):
for j in range(config.head_dim):
current_kv_head = head_idx * config.num_key_value_heads // config.num_attention_heads # grouped attention
kv_indx = (config.head_dim * current_kv_head) + j
out[j] += activations[token_idx] * v_cache[token_idx][layer_idx][kv_indx]
return out
#Sliding Window Attention
GPTOSS also implements a sliding window mechanism.
For even-numbered layers, the current token attends to the previous 128 tokens at most, which improves efficiency with minimal performance loss.
Implementation:
python
# function definition...
sliding_window = layer_idx % 2 == 0
start = max(0, pos - config.sliding_window) if sliding_window else 0
for i in range(start, pos + 1):
# rest of the code doesn't change..
#Attention Sink
An additional important component is the attention sink, a trained value appended to the activations before Softmax.
It provides a stable location for excess attention probability that doesn’t need to be assigned to any real token.
Here’s why it’s useful:
- In transformer architectures, the attention weights over all tokens must sum to 1 after Softmax.
Without an attention sink, the model is forced to distribute attention across actual tokens, even if none are relevant. - The attention sink acts as a “dummy token” or “garbage collector” that absorbs leftover attention mass, allowing the model to ignore irrelevant inputs without distorting the attention distribution.
- It improves training stability, especially early on, by preventing attention scores from collapsing or over-focusing on random tokens.
- Over time, the model learns to assign attention to the sink when no useful context exists, giving it a “none of the above” option.
- This mechanism also helps regularize the model and improve gradient flow, since it provides a consistent, low-variance target during optimization.
python# same code...
activations.append(sinks[layer_idx][head_idx])
activations = softmax(activations)
# same code...
Final Implementation with f32 and bf16 Conversions
Finally, we add the f32 and bf16 conversions in the appropriate places.
Here’s what the final implementation looks like:
pythondef attention_head(q, k_cache, v_cache, layer_sink, head_idx, layer_idx, pos, config):
sliding_window = layer_idx % 2 == 0
start = max(0, pos - config.sliding_window) if sliding_window else 0
current_kv_head = head_idx * config.num_key_value_heads // config.num_attention_heads
activations = [0] * (pos + 1)
for i in range(start, pos + 1):
score = 0.0
for j in range(config.head_dim):
q_idx = (head_idx * config.head_dim) + j
kv_idx = (config.head_dim * current_kv_head) + j
score += f32(q[q_idx]) * f32(k_cache[i][layer_idx][kv_idx])
score *= (1.0 / math.sqrt(config.head_dim))
activations[i] = score
# Replace bf.Array and concat with pure list ops
activations.append(layer_sink[head_idx])
activations = softmax(activations)
out = [0] * config.head_dim
for i in range(start, pos + 1):
attention_weight = activations[i]
for j in range(config.head_dim):
kv_idx = (config.head_dim * current_kv_head) + j
v_val = f32(v_cache[i][layer_idx][kv_idx])
out[j] += f32(attention_weight * v_val)
return [val for val in out]
#Why Attention Works?
Before transformers became the dominant architecture, Recurrent Neural Networks (RNNs) were the standard for modeling sequential data such as text, audio, and time series. Both RNNs and Transformers aim to capture dependencies between tokens, but they do so in fundamentally different ways.
#RNNs vs Attention
An RNN processes input sequentially, one token at a time.
At each time step, it takes the current token embedding and the latent state (or hidden state) from the previous step , and produces a new latent state :
Here, is typically a nonlinearity such as tanh
or ReLU
.
This can be viewed as a simple linear layer that takes both the current input and the previous latent as input, and outputs the next latent.
This design means that information flows step by step through the sequence, with the latent vector acting as a memory that carries accumulated context.
Pros:
- Naturally models sequential dependencies - each token depends on all previous ones.
- Compact architecture with relatively few parameters.
Cons:
- Sequential bottleneck: Tokens must be processed one at a time - no parallelization across sequence length during training or inference.
- Vanishing/exploding gradients: Information from distant tokens often fades or explodes, making long-range context difficult to learn.
- Limited scalability: Increasing model depth or sequence length greatly slows computation.
RNNs rely on sequential information flow - each step depends on the last - making them inherently slow and limited in long-term memory.
Additionally, a token cannot easily access information from tokens far in the past, since all previous tokens are compressed into a single latent state, making it difficult to retain or extract detailed information about earlier context.
RNNs also suffer from the vanishing and exploding gradient problem during training, which further limits their ability to learn long-range dependencies.
Transformers, on the other hand, allow every token to directly perform calculations with any previous token through the attention mechanism.
Even though transformers require more computation - with complexity (because each token attends to all others before it) compared to RNNs’ - their parallelism and scalability make them vastly more efficient in practice.
As a result, attention-based architectures can process much larger datasets, learn richer and more hierarchical representations, and scale up to billions of parameters - capabilities that traditional RNNs were never designed to handle.
#Residual Layer
A residual layer is simply implemented by storing the input vector (for example, to the attention component) in a temporary variable and then adding it element-wise to the output of that component.
It is implemented as follows:
python
h_x = x
f_x = Attention(x) # x: (hidden_size,)
#Residual
for i in range(len(f_x)):
h_x[i] += f_x[i]
x = h_x
f_x = MLP(x)
#Residual
for i in range(len(f_x)):
h_x[i] += f_x[i]
Why is it needed?
Without a residual connection, the layer must explicitly learn to preserve the useful parts of .
If the ideal output is close to the input (for example, the identity mapping ), then the model must learn weights and bias such that:
For that to happen, must end up being close to the identity matrix and . However, that’s not easy for gradient descent to discover - especially deep inside a long stack of nonlinear transformations.
So the layer must encode “copy the input” as part of its parameters, which takes effort.
With a residual layer, we instead compute:
Now the function only needs to learn the difference - the small shift, rotation, or enhancement - not the full mapping.
If no change is needed, can simply learn to output zeros, and the input will automatically flow through. That’s an easy solution:
So the skip connection gives the network a direct path for preserving .
This is also related to a more important reason a residual layer is crucial: it allows gradients to flow without vanishing or exploding due to multiplication by factors smaller or larger than 1.
- If those terms are less than 1, the gradients get smaller and smaller → vanishing gradient.
- If they’re greater than 1, the gradients get larger and larger → exploding gradient.
As a result:
- Early layers in deep networks receive almost no gradient signal (they stop learning).
- Or the gradients blow up and destabilize training.
This is why older networks without residuals couldn’t go very deep - training would break down beyond a few dozen layers.
#MLP
This is the other crucial component of the architecture, where most of the weights exist and most computations happen. This layer carries most of the knowledge and skill of the LLM. At the base level, this layer is a series of linear layers, but two critical ideas were introduced:
- Gated Linear Unit
- Mixture of Experts
We will discuss both in the upcoming sections, but first, let’s review how a typical MLP works.
First, the MLP layer contains an initial RMSNorm step 6This is true in GPTOSS and modern LLMs, however, its worth noting that the original architecture used post-norm (norm applied after residual). - nothing unfamiliar here. Next, our hidden state vector is processed by two Linear layers: the first layer contains a weight matrix that is taller than it is wide, so that the output vector is larger. This is called the Upward Projection. The second layer contains a weight matrix that is wider than it is tall, performing the Downward Projection.
The upward projection allows the MLP to project our vector into a higher-dimensional space, enabling it to extract more features. The downward projection back to the original dimension acts as a form of regularization - the network must compress the transformed information, keeping only the most important features. 7A very relevant video that explains this idea by 3Blue1Brown: How might LLMs store facts
#Mixture of Experts
A common trick in modern MLPs is to not have just a single set of linear layers, but multiple sets of weights to choose from, based on how the hidden state vector looks. These sets of weights are called experts, and they are selected based on another linear layer called the gating layer.
After the gating layer gives us a score of how relevant each expert, we choose the top and calculate different linear layers (instead of just one) in parallel, then perform a weighted sum of their outputs (based on the Softmax of the gating layer output) to get the final vector.
Let’s break down that process further.
MLP1 Layer
- MLP1_weight: shape =
(# of experts, intermediate_size, hidden_size)
In GPTOSS 20B, it is(32, 5760, 2880)
. - MLP1_bias: shape =
(# of experts, intermediate_size)
In GPTOSS 20B, it is(32, 5760)
.
After selecting the experts (4 in GPTOSS), we get four weight matrices of shape (5760, 2880)
and four (5760,)
bias vectors. For each expert, we apply a linear layer on our initial hidden state vector of shape hidden_size = (2880,)
to get a vector of length 5760
(upward projection).
Then we apply the SwiGLU layer, which we will discuss shortly.
After that, we take the output from SwiGLU (size (2880,)
) and perform a similar process to MLP1, but this layer’s output is hidden_size
, returning it to the hidden state dimension.
The MoE idea is useful because we can have many experts (32 in this example) but only a few are activated, which means:
- MoEs are computationally efficient.
- Experts specialize in tasks, so for example:
- In coding, expert 0 may be chosen more often.
- In writing, expert 1 might be chosen.
This specialization increases the network’s capacity for diverse tasks without significantly increasing computation during training or inference.
Here’s the implementation:
python# Gating
h = RMSNorm(x, self.mlp_norm[l]) # (hidden_size,)
gate = linear(h.to_list(), self.mlp_gate_weight[l], self.mlp_gate_bias[l], out_in_f32=True) # (32,) = (32,2880) x (2880,)
largest_4_experts = sorted(enumerate(gate), key=lambda x: x[1], reverse=True)[:4]
largest_4_experts_weights = softmax([expert_logit for _, expert_logit in largest_4_experts]) # activations are percentages/weights of which experts to activate
# MoE
moe_out = [0]*self.config.hidden_size # (hidden_size, )
expert_outputs = []
for j, (expert_indx, _) in enumerate(largest_4_experts):
w1_out = linear(h, self.mlp1_weight[l][expert_indx]) + self.mlp1_bias[l][expert_indx]
SwiGLU_out = SwiGLU(w1_out, self.config)
expert_output = linear(SwiGLU_out, self.mlp2_weight[l][expert_indx]) + self.mlp2_bias[l][expert_indx]
expert_outputs.append(expert_output.to_list())
Pro Tip
Perform the top-k selection in high precision to ensure accurate selections, since reduced precision may cause incorrect expert selection if values become equal.
Pro Tip
During the MLP layer, tokens do not communicate with each other - computation happens independently.
#Gated Linear Unit
SwiGLU
There are several types of GLUs, but here we focus on the most common variant - SwiGLU. It works by splitting the input vector into two parts:
- The first acts as a gate, which is passed through a sigmoid and multiplied element-wise with itself.
- The second part is multiplied element-wise by the output gate from the first step.
Instead of just having feed-forward networks, Gated Linear Units (GLUs) are used for two reasons:
- Fewer weights.
- Higher performance.
Why higher performance? Because of the extra multiplicative interaction, which increases expressiveness by adding more interaction between the weights. 8Similar to the way a deeper neural network outperforms a flat one with the same number of weights, due to richer multiplicative interactions.
pythondef swiglu(x, config, alpha=1.702):
result = bf.Array([0] * config.hidden_size)
for i in range(config.intermediate_size):
x_glu = (x[i*2])
x_linear = (x[i*2 + 1])
x_glu = min(config.swiglu_limit, x_glu)
x_linear = min(config.swiglu_limit, max(-config.swiglu_limit, x_linear))
gate = (x_glu * bf16(1 / (1 + (math.exp(bf16(-f32(alpha) * x_glu)))))) # SiLU
result[i] = bf16(gate) * bf16(x_linear + 1)
return result
#Unembedding
The Attention + MoE stack repeats for layers (24 in GPTOSS 20B). Afterward, the output hidden state enters a final linear layer called the unembedding layer (or output projection layer). Its job is to map the hidden representation back into the vocabulary space, effectively reversing the token embedding process from the input.
where:
- = final hidden state for the token being processed, of shape
(hidden_size,)
. - = unembedding weight matrix of shape
(vocab_size, hidden_size)
.
The result is a set of logits - raw scores for each token in the vocabulary - representing how likely each token is to come next.
These logits are then passed through a softmax function to get a probability distribution over all possible next tokens.
Pro Tip
The next token is predicted by unembedding the last token only and computing the probability distribution over all tokens.
Pro Tip
Processing occurs in one forward direction - previous tokens are not re-updated based on future tokens. This enables adding tokens without recalculating everything.
Join my email list for insider updates, early access, and content you won't find on the blog.
#Sampling
Sampling is the process of turning these logits - the raw scores for each token - into an actual token choice.
Step 1: Convert logits to probabilities
The logits are first normalized using the softmax function, which turns them into a probability distribution over the entire vocabulary:
where:
- is the probability of token i.
- is the temperature parameter.
The temperature controls how deterministic or random the sampling is:
- Low T (< 1.0): Makes the distribution sharper → the model becomes more confident and deterministic.
- High T (> 1.0): Makes the distribution flatter → introduces more randomness.
- T = 1.0: Default, leaves the logits unchanged.
Temperature | Effect | Example Behavior |
---|---|---|
0.7 | Balanced | Creative but coherent |
1.0 | Neutral | Natural sampling |
1.5 | Chaotic | More diverse outputs |
Step 2: Filtering methods
Before sampling, the model can optionally filter unlikely tokens to avoid nonsense or extremely rare words.
Common strategies:
- Top-k sampling: Keep only the top k highest-probability tokens and renormalize.
- Top-p (nucleus) sampling: Keep the smallest set of tokens whose cumulative probability ≥ p (e.g., 0.9), then renormalize.
- Typical sampling: Select tokens based on how “typical” they are relative to the distribution’s entropy.
Method | Parameter | Description |
---|---|---|
Top-k | Keeps top 50 tokens only | |
Top-p | Keeps tokens with 90% cumulative probability | |
Typical | Keeps tokens near average surprise (entropy) |
Step 3: Token selection
After filtering, the model samples one token based on the resulting probability distribution - usually using random selection weighted by probability.
Alternatively, for greedy decoding, the model simply picks the token with the highest probability (argmax), but this often leads to repetitive or dull text.
The chosen token is then:
- Appended to the current sequence.
- Fed back into the model as the next input.
- The process repeats - generating one token at a time until an end-of-sequence (EOS) token or a max length is reached.
#RoPE
- transformers are unaware of token position, we need a way to encode the token
- positional encodings, add a vector to the transformer vector that represents the position
- encoding using sin cos benefit
- cons:
- network has to do the heavy lifting to calculate the relative positioning
- does not extrapolate. cannot increase the context.
- other cons if any
- relative encoding, why does it solve the cons
#rope: - how does it work, background
Absolute Encoding:
- Add a vector of sinusoidal (or trained, similar performance) embeddings to the embedding vector we have
- Good because efficient
Cons:
- Because if sequence is larger than our training, we cannot extrapolate
- Rigid structure: Cannot easily handle variable-length sequences or dynamic positioning
Relative Encoding:
- Modify every q and k together so that
- Good because you can extrapolate, also far away words are attending less
- Bad because very slow, every q and k have to change (can caching still help?)
RoPE:
- Instead of adding a separate embedding, just rotate the vector, so that the position becomes a property of the vector itself
- Efficient
- Allows extrapolation
- Words that are far from each other have lower attention vals (exponential decay)
- Early dims oscillate at low frequency (big wavelengths) that captures global features (since phase shift between nearby tokens is minimal).
- Frequency (Theta) does not change with pos, (m-n) phase does. Phase warps may happen but across some dims, not all, since every dim have different frequency, they won't all occur at once.
- 2D case explanation and diagram.
- When multiplying, the rotation cancels out symmetrically in such a way that the score depends only on relative position
- In low frequencies, the vector itself rotates slower, therefore, harder for model to detect differences between rotation.
- Rotation happens to both query and key, so if they are close, the embedding space is more similar. If they are farther away from each other, even close parts in embedding space would shift.
- Low & high frequencies stay in phase for close tokens, while only low stay (kinda) in phase for farther tokens.
- Long-distance comparisons do degrade the fidelity of semantic meaning.
Where is the angle between them.
For a single frequency component rotating at rate per position:
- Token at position : rotated by angle
- Token at position : rotated by angle
- Angular difference:
When you compute , the dot product in those rotated dimensions becomes:
This directly encodes the relative distance in the magnitude of the dot product contribution.
pythondef ROPE(q, k, pos):
for h in range(config.num_attention_heads):
for i in range(0, config.head_dim, 2):
theta = math.pow(config.rope_theta, -2*i / config.head_dim)
co = math.cos(pos * theta)
si = math.sin(pos * theta)
j = h * config.head_dim + i #indx to help us access array
q_real = q[j]
q_imag = q[j + 1]
q[j] = q_real * co - q_imag * si
q[j + 1] = q_real * si + q_imag * co
if h < config.num_key_value_heads:
k_real = k[j]
k_imag = k[j + 1]
k[j] = k_real * co - k_imag * si
k[j + 1] = k_real * si + k_imag * co
YARN:
- (NTK scaling) To extrapolate to longer sequences, ppl expand/scale all frequencies (all become slower, since every theta is divided by s). This is not good for high freq. When high frequencies become lower (higher wavelength due to scaling), the model becomes less sensitive to fine-grain details (blurry, not clear).
- (NTK aware 3.1): Scale frequencies in different amounts. (NTK by parts: 3.2): Scale long wavelengths more, don't scale shortest wavelengths at all.
- (Dynamic 3.3): Sequence length isn't fixed: if you use a fixed scale factor as a function of sequence length, you may over/under-scale based on length of sequence. Solution: Compute scaling with each forward pass.
RoPE frequencies (base):
Ratio function:
Ramp function:
Interpolation rule (NTK-by-parts):
where (scaling factor = new context length/ pretraining context length)
Modified attention (temperature scaling):
Temperature value:
#Optimization
Our implementation was completely sequential for education purpose, however, in practice:
- Attention heads are parallelized, since they don't interact with each other.
- Input tokens are parallelized 9It's important to note that later tokens depend on earlier tokens and thus, the computation of earlier tokens through every layer must be finished so that later tokens could use their K,V (however, tokens being generated by the LLM during the output are still generated in a sequential matter)
- Experts are parallelized as well, different experts compute independent of each other.
Other optimizations are also done during inference, Quantization techniques are often used to reduce memory and computational requirements by representing weights and activations with lower precision (e.g., INT8 or FP8). Operator fusion combines multiple small operations (like matrix multiplications and layer normalizations) into single, more efficient kernels. Pipeline and tensor parallelism are employed across multiple GPUs or nodes to further distribute the computational load. Finally, speculative decoding and prefill-decode separation can speed up autoregressive generation by parallelizing token prediction in early stages and caching intermediate results.
We also did not go through how these models are trained but this could be the subject of another blog post.
We also did not discuss how the GPTOSS MLP weights are stored, which is done through MXFP4 Quantization, but I may add it to this blogpost later on.
#Further Reading
Here are some excellent supplementary resources
Andrej Karpathy's Zero to Hero: Neural Networks: Zero to Hero
GPTOSS design choices and comparison with other models: From GPT-2 to gpt-oss: Analyzing the Architectural Advances
Neural Networks by 3Blue1Brown: 3Blue1Brown Neural Network Course
Love this blog's design? Hire me!
Join my email list for insider updates, early access, and content you won't find on the blog.