Understanding Mixtral-8x7b
This blog post is adapted from an X thread I posted. It's garnered significant interest, so I decided to post it here as well!
Mixtral-8x7b by MistralAI is an LLM that outperforms all but OpenAI and Anthropic's most powerful models. And, it is open-source. In this blog post, I will explain its architecture design using my Neural Circuit Diagrams. Let's dive in and see how cutting-edge transformers work!
From LMSys' Chatbot Arena. Mixtral-8x7b is very, very good. You can try it in the arena for yourself!
The overall structure of the architecture is shockingly simple. It is a decoder-only transformer. The model input is a series of tokens, which are embedded into vectors, and are then processed via decoder layers. The output is the probability of every location being occupied by some word, allowing for text infill and prediction.
The overall model converts tokens to vectors, processes them, and converts them back to word probabilities.
Every decoder layer has two key sections: an attention mechanism, which incorporates contextual information; and a multi-layer perceptron, which individually processes every word vector.
These are encapsulated in residual connections, which allows for training at depth. A combination of contextual and individual processing allows for sophisticated patterns to be discovered.
The decoder layers are akin to the original transformer's, but exclusively use self-attention.
The attention mechanism used is similar to the original transformer's, which I cover in detail in my paper and briefly in a YouTube video. I list additional key features in the diagram, also covered in the original github and Hugging Face docs.
A key feature not explicitly shown in the below diagram is FlashAttention by Hazy Research, which accelerates algorithms by decomposing attention to fit on GPU kernels enabling high-speed memory access. I've been making progress using Neural Circuit Diagrams to derive such techniques. Explicitly stating variables in memory, linearity, and broadcasting is naturally displayed by them, offering the formal tools needed to accelerate algorithms.
Attention mechanisms have gradually evolved since popularized by 2017's Attention is All You Need.
Finally, we get the key feature of Mixtral: Sparse Mixture of Experts (SMoE). MLP layers are immense consumers of computational resources. SMoEs have multiple layers ("experts") available. For every input, a weighted sum is taken over the outputs of the most relevant experts. SMoE layers can therefore learn sophisticated patterns while having relatively inexpensive compute cost.
A gating mechanism decides which layers to execute, leading to a computationally efficient algorithm. See also Switch Transformers and MegaBlocks.
Conclusion. Mixtral is an immense achievement for the open-source AI community. The model is surprisingly simple. Compared to the original transformer architecture, encoders have been removed. Attention mechanisms have incurred 7 years' gradual innovations. The biggest change is the presence of SMoEs instead of plain MLPs. Mixtral has proven that open-source designs and SMoEs are on the frontier of ML development, and I suspect both will attract far more attention as a result.
The overall attention architecture, expressed using Neural Circuit Diagrams.