inference optimization
Paper • 2205.14135 • Published • 12Note https://spaces.ac.cn/archives/10091/comment-page-1 MHA -> MQA(Multi-Query Attention) -> GQA(Group Query Attention) -> MLA(Multi-Head Latent Attention)
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Paper • 2307.08691 • Published • 8
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision
Paper • 2407.08608 • Published • 1Note A guide to LLM inference and performance https://www.baseten.co/blog/llm-transformer-inference-guide/ LLM inference speed of light https://zeux.io/2024/03/15/llm-inference-sol/
Fast Transformer Decoding: One Write-Head is All You Need
Paper • 1911.02150 • Published • 6Note MQA(Multi Query Attention) 减少KV Cache的一次非常朴素的尝试 MQA的思路很简单,直接让所有Attention Head共享同一个K、V,用公式来说,就是取消MHA所有的k,v的上标(s) 使用MQA的模型包括PaLM、StarCoder、Gemini等。很明显,MQA直接将KV Cache减少到了原来的1/h,这是非常可观的,单从节省显存角度看已经是天花板了。 效果方面,目前看来大部分任务的损失都比较有限,且MQA的支持者相信这部分损失可以通过进一步训练来弥补回。此外,注意到MQA由于共享了K、V,将会导致Attention的参数量减少了将近一半,而为了模型总参数量的不变,通常会相应地增大FFN/GLU的规模,这也能弥补一部分效果损失。
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Paper • 2305.13245 • Published • 5Note GQA(Grouped-Query Attention) 有人担心MQA对KV Cache压缩太严重,会影响模型的学习效率以及最终效果,因此MHA与MQA之间的过渡版本GQA应运而生,GQA的思想也很朴素,它就是将所有Head分为g个组,g可以整除h,每组共享同一对K、V。当g=h时就是MHA,g=1时就是MQA,当1
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 14Note MLA(Multi-head Latent Attention)