Mastering Long Contexts in LLMs with KVPress

Community Article Published January 23, 2025

Upvote

simjeg Simon Jegou

nvidia

mjeblicknvidia Maximilian Jeblick

nvidia

What is KV Cache and Why does it matter?

The Problem: KV Cache and its Linearly Scaling Burden
The Size of the KV Cache

KVPress: A toolkit for KV Cache Compression

KVPress in Action

TL;DR: KVPress packs the latest KV cache compression techniques, enabling memory-efficient long-context LLMs. 🚀

One of the key features of Large Language Models (LLMs) is their context window—the maximum number of tokens they can process in a single request. As LLMs evolve, their context windows are becoming increasingly larger.

Larger context windows unlock incredible possibilities:

In-context retrieval: Seamlessly referencing large amounts of text within a single query.
In-context learning: Adapting behavior to specific examples within the same session.
Extended reasoning: Handling very long chains of thought without breaking context.

However, these extended windows come at a cost—memory taken by the long context in the KV Cache becomes hard to manage. For instance, handling 1M tokens with Llama 3-70B in float16 demands 330GB for KV Cache, rendering it infeasible for many applications.

In this blog post, we'll address one solution for this problem: compressing the KV Cache for more efficient generation. To achieve this, we'll explore:

What the KV Cache is and why it matters.
KVPress, a powerful toolkit from NVIDIA designed to compress KV Cache effectively.
The inner workings of KVPress and how it achieves compression.

Before getting started, explore KVPress in this Space (you'll find examples at the end if needed):

What is KV Cache and Why does it matter?

Figure 1: Key Value cache inside the attention module (Source: NVIDIA)

In autoregressive models, text generation happens token by token, with each prediction relying on all preceding tokens for context. For example:

To generate token 1000, the model must consider the representations of tokens 1 to 999.
To generate token 1001, the same information (tokens 1 to 999) must be processed again, along with token 1000.

This repetitive computation becomes inefficient as the sequence grows, especially for large models. KV Cache optimizes this process by storing the intermediate results—keys (K) and values (V)—from the attention layers, so the model can reuse them for future tokens instead of recalculating them.

The Problem: KV Cache and its Linearly Scaling Burden

As powerful as the KV Cache is, it comes with a major drawback-it scales linearly with the size of the context window. While this might not sound alarming at first, let's break it down to see why this becomes a serious bottleneck.

The Size of the KV Cache

The values stored in the KV Cache come from all the attention blocks used by the model. Therefore, its size depends on the model architecture, which dictates the number of attention heads. More concretely, the memory consumed by the KV Cache is determined by the following equation:

$\text{Size}(\text{KV}) = 2 \times \text{precision} \times n_{layers} \times n_{heads} \times d \times n_{tokens}$

Each of these factors contributes to the explosion in memory usage. To make this more tangible, let's consider a concrete example—Llama 3-70B running in bfloat16 precision (as recommended by the model authors) with a context size of 1M tokens:

$\text{Size}(\text{KV}) = 2 \times 2 \times 80 \times 8 \times 128 \times 1M = 327.6 \text{GB}$

Since bfloat16 uses 2 bytes per parameter, the model weights alone require 140 GB (70B x 2 bytes). This means that running the model with a 1M token context size demands approximately 470 GB of memory, with the KV cache alone accounting for a staggering 70% of this total.

KVPress: A toolkit for KV Cache Compression

As we've seen, the KV Cache is both a critical enabler and a significant bottleneck for deploying large language models (LLMs) with long context windows. Addressing the linearly scaling memory problem requires innovative compression techniques, and that's exactly where KVPress steps in.

KVPress, developed by NVIDIA, is a Python toolkit designed to address the memory challenges of large KV Caches by providing a suite of state-of-the-art compression techniques. It also integrates with other approaches, such as KV Cache Quantization, a method built into the transformers library to reduce memory usage (the precision term in the equation above), further expanding its utility (details here).

For researchers specializing in compression, KVPress offers a flexible and modular framework, making it easy to understand and extend with new methods. For developers, KVPress simplifies the process of deploying these cutting-edge techniques, enabling quick and efficient integration into real-world applications.

KVPress in Action

At its core, KVPress leverages presses, which are advanced compression algorithms specifically designed to reduce the memory footprint of the KV Cache.

Many of these presses rely on a score that is used in each head to prune the KV pairs with the lowest importance. For instance the KnormPress prunes the KV pairs with lowest key value norm (paper), and SnapKVPress prunes the KV pairs associated with a low attention weights for the latest queries (paper).

These presses are seamlessly integrated into the attention layers of the model using forward hooks.

Figure 2: KV Compression visualized (Source: NVIDIA)

During text generation, they dynamically compress the KV Cache, reducing memory usage without compromising the model's ability to generate coherent and accurate outputs. Each press is characterized by a compression_ratio attribute, which determines the degree of compression applied to the KV Cache.

These presses integrate seamlessly with a custom transformers pipeline, enabling easy application and experimentation.

Here's how you can use a press with KVPress with one of the many presses, ExpectedAttentionPress. This press prunes the KV pairs associated with the lowest expected attention weight for future queries.

from transformers import pipeline
from kvpress import ExpectedAttentionPress

pipe = pipeline(
"kv-press-text-generation",
model="meta-llama/Llama-3.1-8B-Instruct",
device="cuda",
model_kwargs={"attn_implementation": "sdpa"}
)

context = "A very long text you want to compress once and for all"
question = "\nA question about the compressed context"  # optional

press = ExpectedAttentionPress(compression_ratio=0.5)
answer = pipe(context, question=question, press=press)["answer"]

Try using it directly in this Hugging Face space or in this Google colab notebook!

By targeting the pre-filling phase, KVPress ensures that the cache is compressed when it's largest—helping reduce memory overhead for sequences with tens of thousands or even millions of tokens.

The plot below demonstrates the GPU memory savings achieved with KVPress compression as prompt lengths increase. For shorter prompts, most of the memory is allocated to model weights—approximately 15GB for Llama 3.1 8B in bfloat16. However, as prompt lengths grow, the KV cache becomes a major contributor to memory consumption. For 128k context length, applying KVPress with a 50% compression ratio reduces peak memory usage from 45GB to 37GB. This smaller KV cache also improves decoding speed, from 11 tokens per second to 17 tokens per second on an A100 GPU (source). .

Figure 3: Memory usage vs Context length (Source: NVIDIA)

Benchmarks

The research community has been actively developing various techniques for KV cache compression. KVPress encourages researchers to contribute their methods and already provides more than a dozen presses.

To evaluate the performance of these presses, KVPress includes a simple CLI for benchmarking them on standard long-context datasets such as RULER, InfiniteBench, and Loogle. The plot below benchmarks 9 different presses on the RULER dataset with 4k context length and different compression ratio. The best performing press on this dataset is a combination of the AdaKVPress (paper) and ExpectedAttentionPress, a new unpublished pruning technique created by the authors of KVPress (more information here).

Figure 4: Average score vs Compression ratio (Source: NVIDIA)

Conclusion

The growing context windows of LLMs unlock new possibilities but pose significant memory challenges with the linearly scaling KV Cache. KVPress addresses this by compressing the cache during the critical pre-filling phase.

While KVPress improves memory efficiency, higher compression ratios can impact model accuracy, as shown in the benchmark plot. Further research is needed to develop more effective compression algorithms that minimize trade-offs.

With its seamless integration into the transformers library and modular design, KVPress empowers researchers and developers to handle long-context LLMs efficiently and design new compression techniques. It's a practical solution for scaling LLMs without overwhelming memory resources—ensuring innovation stays accessible as models grow.

Community

julien-c

Jan 23

really cool post! love the embedded Gradio app, cc @abidlabs

simjeg

Article author Jan 24

Happy you like it @julien-c ! Feel free to share it on social media to raise awareness around the package 🤗

Sifal

Jan 23

Love it! Thanks for writing this! Curious about the impact on the FLOP count. Since we're reducing the size of the KV cache, the number of operations F(QK.T)V will also decrease, but is this reduction being replaced by the compression

simjeg

Article author Jan 24

We've not measured FLOPs, but we have a few plots here that measure total time for generation here: https://github.com/NVIDIA/kvpress/blob/main/notebooks/speed_and_memory.ipynb

For most presses, the compression computation are very light compared to the forward pass of the long context itself.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote