MInference

Weclome to use MInference, which leverages the dynamic sparse nature of LLMs' attention, which exhibits some static patterns, to speed up the pre-filling for million tokens LLMs. It first determines offline which sparse pattern each head belongs to, then approximates the sparse index online and dynamically computes attention with the optimal custom kernels. This approach achieves up to a 10x speedup for pre-filling on an A100 while maintaining accuracy with 1M tokens.

For more detail please check,
project page: https://aka.ms/MInference
code: https://github.com/microsoft/MInference
paper: MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention (2407.02490)
hf demo: microsoft/MInference

liyucheng

authored a paper 7 months ago

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Paper • 2407.02490 • Published Jul 2, 2024 • 23

iofu728

authored a paper 7 months ago

MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention

Paper • 2407.02490 • Published Jul 2, 2024 • 23

iofu728

posted an update 11 months ago

Post

1519

Welcome to LLMLingua-2, a small-size yet powerful prompt compression method trained via data distillation from GPT-4 for token classification with a BERT-level encoder, excels in task-agnostic compression. It surpasses LLMLingua in handling out-of-domain data, offering 3x-6x faster performance. @qianhuiwu

website: https://llmlingua.com/llmlingua2.html
code: https://github.com/microsoft/LLMLingua
demo: microsoft/llmlingua-2

2 replies

iofu728

authored a paper 11 months ago

LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression

Paper • 2403.12968 • Published Mar 19, 2024 • 25

iofu728

authored a paper over 1 year ago

LongLLMLingua: Accelerating and Enhancing LLMs in Long Context Scenarios via Prompt Compression

Paper • 2310.06839 • Published Oct 10, 2023 • 3

AI & ML interests

Recent Activity

Team members 2

MInference's activity