What changed in the Transformer architecture
Since the introduction of the Transformer architecture in 2017 (“Attention Is All You Need”), we have witnessed an explosion in the use of Transformers for natural language processing, computer vision, speech recognition, and more. Over the years, a series of refinements have incrementally improved the Transformer’s stability, efficiency, and ability to handle ever-larger context windows.
One of the main drivers for these refinements is the rapid growth of LLMs. Models like GPT, LLaMA, and others have scaled to billions or even trillions of parameters. Training such models effectively demands:
- Efficient Use of Data: Minimizing unnecessary padding in batches and leveraging techniques such as dynamic sequence packing.
- Longer Context Windows: Modern LLMs often require context lengths in the thousands or tens of thousands of tokens, placing heavy demands on the model’s ability to encode positional information accurately.
- Training Stability: As models grow in depth and size, ensuring stable gradient flow becomes more challenging, making robust normalization strategies essential.
- Computational Efficiency: Reducing memory footprint and compute time while maintaining or improving performance is a constant engineering challenge.
Data Packing and Positional Information
A concrete example of efficiency improvements is how training data is organized. Previously, each document in a batch might have been padded to a fixed maximum length, leading to wasted space if the document was short. This often looked like:
[ Doc 1 ] [ lots of padding ]
[ Doc 2 ] [ lots of padding ]
...
Now, an increasingly common strategy is to pack multiple documents into a single sequence or batch more efficiently, for example:
[ Doc 1 ] [Sep] [ Doc 2 ] [Sep] [ Doc 3 ] [Sep] [ less padding ]
[ Doc 4 ] [Sep] [ Doc 5 ] [Sep] [less padding]
...
Such “dynamic packing” or “efficient training data packing” reduces padding and better utilizes the model’s context window. However, this also means the model must be very precise about positional information: it needs to know that after the separator token [Sep]
, the token at position “1” actually belongs to the next document. This is one of the key reasons for integrating positional embeddings at the self-attention layer—specifically, techniques like Rotary Positional Embeddings (RoPE)—so that each new document’s token indices can be handled properly even if they appear mid-batch.
In short, while large context windows and minimal padding improve throughput, they also place higher demands on the model’s architecture to accurately track positions. Modern Transformer designs have adapted to meet these challenges, as we will now explore in detail.
1. Positional Encoding vs. Rotary Embeddings
Original Approach (2017)
- Sinusoidal Positional Encoding: The original Transformer design injects positional information at the input embedding level using fixed sinusoidal patterns. Each token’s embedding is added to a corresponding sinusoidal vector that encodes absolute position (e.g., token #1, token #2, etc.).
Modern Approach (2025)
- Rotary Positional Embeddings (RoPE): Instead of adding positional encodings at the input stage, many new architectures incorporate position information directly into the attention mechanism.
- Key Properties:
- Token Indices in Packed Sequences: RoPE can more flexibly represent position when multiple documents are concatenated within a single batch. Each attention head can precisely interpret that “token 1 of Doc 2” follows some
[Sep]
token, rather than blindly applying a single continuous absolute index across all tokens. - Longer Context Windows: By embedding positional information through rotation in the query-key space, the model can better extrapolate to longer sequences without retraining or redesigning the embedding layer.
- Smooth Integration: RoPE is integrated within the attention’s Q/K vectors, so the positional information is used only where it’s needed—when computing attention scores.
- Token Indices in Packed Sequences: RoPE can more flexibly represent position when multiple documents are concatenated within a single batch. Each attention head can precisely interpret that “token 1 of Doc 2” follows some
2. Pre-Layer Normalization
Original Approach (2017)
- Post-Layer Normalization: In the original Transformer, the sub-layers (Self-Attention or Feed-Forward) are followed by an “Add & Norm” step. So the flow is:
- Sub-layer (e.g., Attention)
- Add residual connection
- Apply LayerNorm
Modern Approach (2025)
- Pre-Layer Normalization: Many newer Transformers move the normalization step to before the sub-layer. The flow becomes:
- Apply LayerNorm (or RMSNorm)
- Sub-layer
- Add residual connection
- Advantages:
- Training Stability: Empirical evidence shows that pre-normalization can significantly reduce training instabilities, especially in deeper architectures.
- Gradient Flow: By normalizing inputs to each sub-layer, gradients pass more consistently back through the network, mitigating issues with exploding or vanishing gradients.
- RMSNorm: Some variants replace LayerNorm with RMSNorm, which normalizes activations based on their root mean square and can be more efficient or stable in certain cases.
3. Grouped-Query Attention
Original Self-Attention (2017)
- Multi-Head, Single Query Group: While the original Transformer splits queries, keys, and values into multiple heads, it doesn’t explicitly reorganize the queries themselves beyond each head’s dimension partition.
Modern Self-Attention (2025)
- Grouped-Query Attention: A more advanced mechanism that partitions or factors the query vectors in a way that can:
- Enhance Efficiency: By grouping queries, certain operations within attention can be batched or vectorized more effectively.
- Encourage Specialization: Each query group might focus on different aspects or patterns in the input sequence, potentially acting like specialized sub-attention modules.
- Improve Scaling: As context windows grow, grouped-query approaches can help keep computational and memory costs under control, although the specifics vary across implementations.
Grouped-query attention can be implemented in multiple ways. In some approaches, the model divides queries into separate groups that each attend to distinct subsets of keys. In others, the grouping is more about how the query dimension is factored and processed. Either way, the principle is to allow more nuanced or efficient attention computations.
4. Putting It All Together
Bringing these ideas together, the modern Transformer block might look like this:
- Input: Hidden representation from the previous layer (or from the embedding layer for the first block).
- Pre-Normalization: Apply LayerNorm or RMSNorm to the input.
- Self-Attention (with Rotary Embeddings + Grouped-Query):
- Each query-key pair is rotated according to its position via RoPE, ensuring that the model handles positional information dynamically.
- Queries might be grouped or partitioned to improve efficiency and representation power.
- Residual Connection: Add the attention output to the input of this sub-layer.
- Pre-Normalization for Feed-Forward: Normalize again before the feed-forward sub-layer.
- Feed-Forward: A multi-layer perceptron (often with a GELU or ReLU activation) processes the normalized representations.
- Residual Connection: Add the output of the feed-forward sub-layer to its input.
- Output: Pass this to the next block (or to the final prediction layer if it’s the last block).
5. Why These Changes Matter
Handling Longer Contexts & Efficient Data Packing
- Modern language models often process thousands of tokens per sequence. The shift from sinusoidal to rotary positional embeddings allows the model to better encode position information within these large contexts and across multiple documents in a single batch.
- Efficient data packing (e.g.,
[Doc 1] [Sep] [Doc 2] [Sep] [Doc 3] ...
) becomes more feasible, as the model can handle each document’s token indices flexibly without confusion from a single, continuous positional encoding.
Stability in Large-Scale Training
- Pre-layer normalization is known to reduce the risk of exploding/vanishing gradients. This is crucial when training models with billions or trillions of parameters, where even small instabilities can derail the entire training process.
Improved Performance & Efficiency
- Empirical results suggest that these design choices—especially pre-norm and grouped-query attention—can lead to lower perplexities on language modeling tasks and improved performance on downstream benchmarks.
- They also can reduce memory usage and speed up certain attention computations, which is vital for large-batch training on specialized hardware (TPUs, GPUs).
Alignment with Future Trends
- As models continue to expand in size and context window, techniques that integrate positional information at the attention level and streamline computations will likely remain at the forefront of Transformer research.
6. Conclusion
The evolution from the original 2017 Transformer block to a Llama 3 like block exemplifies how small but crucial architectural refinements can significantly improve large-scale language models. By:
- Embedding positional information in the self-attention mechanism (Rotary Positional Embeddings),
- Moving normalization to a pre-layer position (pre-LayerNorm or RMSNorm),
- Introducing grouped-query attention for efficiency and better representational capacity,
- And optimizing data packing to minimize padding,
modern Transformers achieve better training stability, handle longer sequences, and make more efficient use of hardware resources.
As research continues, we can expect even more specialized attention mechanisms and normalization strategies, driven by the ongoing demand for larger models and more context. Nonetheless, these modern refinements are a powerful step forward, ensuring that Transformers remain the go-to architecture for a broad range of complex tasks in natural language processing. These are just some of the changes that I shared as I keep learning about them, if you know some more refinements please comment below.
Further Reading
- “Attention Is All You Need” (Vaswani et al., 2017): The original Transformer paper. https://arxiv.org/abs/1706.03762
- Rotary Position Embeddings: Introduced to handle large context windows and seamlessly integrate position into Q/K vectors. E.g.: https://blog.eleuther.ai/rotary-embeddings
- Pre-Layer Normalization: Investigations from labs showing improved stability in very deep Transformer networks. E.g.: https://sh-tsang.medium.com/review-pre-ln-transformer-on-layer-normalization-in-the-transformer-architecture-b6c91a89e9ab
- Grouped-Query Attention: Various experimental implementations in open-source libraries (PyTorch, JAX, etc.) for scaling attention efficiently. E.g.: https://www.ibm.com/think/topics/grouped-query-attention
- Data Packing Strategies: Tutorials and blogs on dynamic sequence packing, minimizing padding, and maximizing GPU utilization for large-batch training. E.g.: https://www.carted.com/blog/variable-length-sequences-in-tensorflow-part-1