jeffra commited on
Commit
ab3ca7c
·
verified ·
1 Parent(s): 8b82e1e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -5
README.md CHANGED
@@ -8,11 +8,14 @@ base_model:
8
 
9
  The Snowflake AI Research team is releasing a series of SwiftKV optimized Llama-3.1 models. [SwiftKV](https://arxiv.org/abs/2410.03960) is a series of inference optimizations that goes beyond traditional key-value (KV) cache compression. This method reduces computational overhead during prompt processing by combining model rewiring and knowledge-preserving self-distillation, allowing prefill tokens to skip up to half the model's layers. SwiftKV achieves up to 2x improvements in throughput, latency, and cost efficiency with minimal accuracy loss, making LLM deployments more performant and economically viable.
10
 
11
- For more details about the technique
12
- * Blog: https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/
13
- * arXiv paper: https://arxiv.org/abs/2410.03960
 
14
 
15
- ## Eval metrics
 
 
16
 
17
  | Llama-3.1-405B-Instruct-FP8 | Arc Challenge | Winogrande | HellaSwag | TruthfulQA | MMLU | MMLU cot | GSM8K | Avg |
18
  |-----------|---------------|------------|-----------|------------|------|----------|-------|-----|
@@ -24,7 +27,7 @@ For more details about the technique
24
  | Baseline | 82.00 | 77.90 | 80.40 | 54.56 | 67.90 | 70.63 | 82.56 | **73.71** |
25
  | 50% SingleInputKV | 80.38 | 78.22 | 79.30 | 54.54 | 67.30 | 69.73 | 79.45 | **72.70** |
26
 
27
- ## How to use the models
28
 
29
  Instructions on how to use vLLM for both evaluation and performance benchmarks:
30
  https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv
 
8
 
9
  The Snowflake AI Research team is releasing a series of SwiftKV optimized Llama-3.1 models. [SwiftKV](https://arxiv.org/abs/2410.03960) is a series of inference optimizations that goes beyond traditional key-value (KV) cache compression. This method reduces computational overhead during prompt processing by combining model rewiring and knowledge-preserving self-distillation, allowing prefill tokens to skip up to half the model's layers. SwiftKV achieves up to 2x improvements in throughput, latency, and cost efficiency with minimal accuracy loss, making LLM deployments more performant and economically viable.
10
 
11
+ For more details about SwiftKV and how to use it:
12
+ * ❄️ [SwiftKV: Accelerating Enterprise LLM Workloads with Knowledge Preserving Compute Reduction (blog)](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/)
13
+ * 📝 [SwiftKV: Fast Prefill-Optimized Inference with Knowledge-Preserving Model Transformation (arXiv)](https://arxiv.org/abs/2410.03960)
14
+ * 🚀 [Getting started guide](https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv)
15
 
16
+ ## Eval Metrics
17
+
18
+ For a full breakdown on evaluation metrics and performance impact please refer to our [blog](https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/) and [arXiv paper]((https://arxiv.org/abs/2410.03960)) but below we've outlined some relevant evaluation metrics.
19
 
20
  | Llama-3.1-405B-Instruct-FP8 | Arc Challenge | Winogrande | HellaSwag | TruthfulQA | MMLU | MMLU cot | GSM8K | Avg |
21
  |-----------|---------------|------------|-----------|------------|------|----------|-------|-----|
 
27
  | Baseline | 82.00 | 77.90 | 80.40 | 54.56 | 67.90 | 70.63 | 82.56 | **73.71** |
28
  | 50% SingleInputKV | 80.38 | 78.22 | 79.30 | 54.54 | 67.30 | 69.73 | 79.45 | **72.70** |
29
 
30
+ ## Getting Started
31
 
32
  Instructions on how to use vLLM for both evaluation and performance benchmarks:
33
  https://github.com/Snowflake-Labs/vllm/tree/swiftkv/examples/swiftkv