--- library_name: deepseek-mla tags: - attention-mechanism - transformers - pytorch - mla - efficient-attention pipeline_tag: text-generation language: en license: mit --- # DeepSeek Multi-Head Latent Attention This repository provides a PyTorch implementation of the Multi-Head Latent Attention (MLA) mechanism introduced in the DeepSeek-V2 paper. **This is not a trained model, but rather a modular attention implementation** that significantly reduces KV cache for efficient inference while maintaining model performance through its innovative architecture. It can be used as a drop-in attention module in transformer architectures. This repository is part of a series implementing the key architectural innovations from the DeepSeek paper. See the **Related Implementations** section for the complete series. ## Key Features - **Low-Rank Key-Value Joint Compression**: Reduces memory footprint during inference - **Decoupled Rotary Position Embedding**: Enables efficient position-aware attention - **Optimized Cache Management**: Handles both compressed KV states and rotary embeddings - **Cross-Attention Support**: Works for both self-attention and cross-attention scenarios ## Installation Clone this repository: ```bash git clone https://huggingface.co./bird-of-paradise/deepseek-mla ``` Or download directly from the HuggingFace repository page. ## Quick Start ```python import torch from src.mla import MultiHeadLatentAttention # Initialize MLA mla = MultiHeadLatentAttention( d_model=512, # Model dimension num_head=8, # Number of attention heads d_embed=512, # Embedding dimension d_c=64, # KV compression dimension d_c1=64, # Query compression dimension d_rotate=32, # Rotary embedding dimension ) # Input sequence x = torch.randn(2, 10, 512) # [batch_size, seq_len, d_model] # Forward pass output = mla(x) ``` ## Testing To run the test suite, execute the following command from the project root directory: ```bash python -m src.tests.test_mla ``` ## Architecture Details ![MLA Architecture](assets/mla_architecture.png) MLA combines two key innovations: 1. Low-rank compression pathway for efficient KV caching 2. Decoupled position-aware pathway using RoPE For detailed architectural insights, see [insights/architecture.md](insights/architecture.md). ## Caching Behavior During inference, MLA maintains two caches: ```python cache_kv: [batch, max_len, d_c] # Compressed KV states cache_rk: [batch, max_len, d_r] # Shared rotary key ``` For detailed insights on attention masking and caching, see [insights/attention_mask.md](insights/attention_mask.md). ## Usage Examples ### Basic Attention ```python # Standard self-attention output = mla(sequence) # Cross-attention output = mla(query, key_value_states=context) ``` ### Cached Generation ```python # Initial forward pass output = mla(prompt, use_cache=True, start_pos=0) # Generate tokens using cache for i in range(max_new_tokens): output = mla(next_token, use_cache=True, start_pos=prompt_len + i) ``` ## Implementation Details The implementation closely follows the formulation in the DeepSeek-V2 paper: ![MLA Formulas](assets/mla_formulas.png) Key aspects: - Separate compression pathways for queries and key-values - Position encoding through decoupled RoPE pathway - Efficient cache management for both pathways ## Related Implementations This repository is part of a series implementing the key architectural innovations from the DeepSeek paper: 1. **[DeepSeek Multi-head Latent Attention](https://huggingface.co./bird-of-paradise/deepseek-mla)**(This Repository): Implementation of DeepSeek's MLA mechanism for efficient KV cache usage during inference. 2. **[DeepSeek MoE](https://huggingface.co./bird-of-paradise/deepseek-moe)**: Implementation of DeepSeek's Mixture of Experts architecture that enables efficient scaling of model parameters. 3. **[Transformer Implementation Tutorial](https://huggingface.co./datasets/bird-of-paradise/transformer-from-scratch-tutorial)**: A detailed tutorial on implementing transformer architecture with explanations of key components. Together, these implementations cover the core innovations that power DeepSeek's state-of-the-art performance. By combining the MoE architecture with Multi-head Latent Attention, you can build a complete DeepSeek-style model with improved training efficiency and inference performance. ## Contributing Contributions are welcome! Feel free to: - Report bugs and issues - Submit pull requests for improvements - Add additional test cases - Provide documentation clarifications Please ensure all tests pass before submitting pull requests. ## Citation ```bibtex @misc{deepseek2024, title={DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model}, author={DeepSeek-AI and et al.}, year={2024}, journal={arXiv preprint arXiv:2405.04434} } ``` ## License [MIT License](LICENSE) ======= --- license: mit ---