Snowflake
/

Llama-3.1-SwiftKV-8B-Instruct

Model card Files Files and versions Community

jeffra commited on Dec 5, 2024

Commit

31ab9b3

·

verified ·

1 Parent(s): 76ca3c1

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -9,7 +9,7 @@ base_model:
 The Snowflake AI Research team is releasing a series of SwiftKV optimized Llama-3.1 models. [SwiftKV](https://arxiv.org/abs/2410.03960) is a series of inference optimizations that goes beyond traditional key-value (KV) cache compression. This method reduces computational overhead during prompt processing by combining model rewiring and knowledge-preserving self-distillation, allowing prefill tokens to skip up to half the model's layers. SwiftKV achieves up to 2x improvements in throughput, latency, and cost efficiency with minimal accuracy loss, making LLM deployments more performant and economically viable.
 For more details about the technique
-* Blog: <!-- add link here -->
 * arXiv paper: https://arxiv.org/abs/2410.03960
 ## Eval metrics

 The Snowflake AI Research team is releasing a series of SwiftKV optimized Llama-3.1 models. [SwiftKV](https://arxiv.org/abs/2410.03960) is a series of inference optimizations that goes beyond traditional key-value (KV) cache compression. This method reduces computational overhead during prompt processing by combining model rewiring and knowledge-preserving self-distillation, allowing prefill tokens to skip up to half the model's layers. SwiftKV achieves up to 2x improvements in throughput, latency, and cost efficiency with minimal accuracy loss, making LLM deployments more performant and economically viable.
 For more details about the technique
+* Blog: https://www.snowflake.com/engineering-blog/swiftkv-llm-compute-reduction/
 * arXiv paper: https://arxiv.org/abs/2410.03960
 ## Eval metrics