PyTorch
megatron-lm
nvidia
llama 2
kvcache
alancucki commited on
Commit
95afeae
·
verified ·
1 Parent(s): 62cf9e3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +149 -6
README.md CHANGED
@@ -1,6 +1,149 @@
1
- ---
2
- license: other
3
- license_name: nvidia-open-model-license
4
- license_link: >-
5
- https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
6
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ license_name: nvidia-open-model-licensen
4
+ license_link: >-
5
+ https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf
6
+ base_model:
7
+ - meta-llama/Llama-2-7b
8
+ tags:
9
+ - nvidia
10
+ - llama 2
11
+ - pytorch
12
+ - kvcache
13
+ library_name: megatron-lm
14
+ ---
15
+ # Llama-2-7B-DMC-4x
16
+
17
+ ## Description
18
+
19
+ Llama-2-7B-DMC-4x is a version of [Llama 2 7B](https://www.llama.com/llama2/), which has been trained to apply the Dynamic Memory Compression (DMC) algorithm ([https://arxiv.org/abs/2403.09636](https://arxiv.org/abs/2403.09636)). With DMC, the model performs on-line key–value cache compression at inference time, achieving substantially better throughput and/or latency. Most importantly, it learns to apply different compression ratios in different heads and layers. The source code for training and inference is provided in the [Megatron-LM](https://github.com/NVIDIA/Megatron-LM/tree/dmc) repository.
20
+
21
+ This model is for research and development only.
22
+
23
+ ### License
24
+
25
+ GOVERNING TERMS: This model is governed by the NVIDIA Open Model License Agreement (found at https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf). <br>
26
+ Additional Information: LLAMA 2 COMMUNITY LICENSE AGREEMENT (found at https://huggingface.co/meta-llama/Llama-2-7b/blob/main/LICENSE.txt). <br>
27
+
28
+ ## Reference
29
+ Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference
30
+
31
+ ## Model Architecture
32
+
33
+ Llama-2-7B-DMC-4x uses a model embedding size of 4096, 32 attention heads, MLP intermediate dimension of 11008, with 32 layers in total. Additionally, it uses Rotary Position Embeddings (RoPE).
34
+
35
+ **Architecture Type:** Transformer Decoder (Auto-regressive Language Model)
36
+
37
+ **Network Architecture:** Llama 2 7B
38
+
39
+ ## Input
40
+ **Input Type:** Text <br>
41
+ **Input Format:** String <br>
42
+ **Input Parameters:** One Dimensional (1D), Temperature
43
+ **Other Properties Related to Input: Max Input Tokens: 4096 <br>
44
+
45
+ ## Output
46
+ **Output Type :** Text <br>
47
+ **Output Format:** String <br>
48
+ **Output Parameters:** One Dimensional (1D) <br>
49
+ **Other Properties Related to Output: Max Output Tokens: 4096 <br>
50
+
51
+ ## Software Integration
52
+ **Runtime Engine(s):**
53
+ * [Not Applicable (N/A)]
54
+
55
+ The model weights are distributed in bfloat16 format. However, it could be converted to other formats in order to run on other hardware microarchitectures.
56
+
57
+ **Supported Hardware Microarchitecture Compatibility:** <br>
58
+ * [NVIDIA Ampere] <br>
59
+ * [NVIDIA Hopper] <br>
60
+ * [NVIDIA Blackwell] <br>
61
+
62
+ **[Preferred/Supported] Operating System(s):** <br>
63
+ * [Linux] <br>
64
+
65
+ ## Model Version(s)
66
+ Llama 2 7B DMC 4x v1.0
67
+
68
+ # Training and Evaluation Datasets
69
+
70
+ ## Training Dataset
71
+
72
+ The model was trained for 18,000 steps with a batch size of 1024, a sequence length of 4096, and a learning rate of 3e-5 with an increasing compression objective. Afterwards, it underwent additional training for 2000 steps with a fixed compression rate of 4x and a smaller learning rate of 3e-6.
73
+
74
+ NVIDIA models are trained on a diverse set of public and proprietary datasets. This particular model was trained on a dataset containing a mixture of texts in English and 37 programming languages.
75
+
76
+ ## Evaluation
77
+
78
+ | Category | Benchmark | # Shots | Llama 2 7B | Llama 2 7B DMC 4x |
79
+ |:------------|:--------------------------------------------|--------:|-----------:|------------------:|
80
+ | General | [MMLU](https://openreview.net/forum?id=d7KBjmI3GmQ) | 5 | 46.7 | 44.2 |
81
+ | Math | [GMS8K](https://arxiv.org/abs/2110.14168) | 5 | 11.9 | 12.6 |
82
+ | Commonsense | [HellaSwag](https://aclanthology.org/P19-1472) | 10 | 78.8 | 78.9 |
83
+ | Commonsense | [Arc-Easy](https://arxiv.org/abs/1803.05457) | 0 | 73.1 | 71.8 |
84
+ | Commonsense | [Arc-Challenge](https://arxiv.org/abs/1803.05457) | 25 | 53.1 | 52.5 |
85
+ | Commonsense | [PIQA](https://ojs.aaai.org/index.php/AAAI/article/view/6239) | 0 | 78.2 | 79.5 |
86
+ | Commonsense | [WinoGrande](https://ojs.aaai.org/index.php/AAAI/article/view/6399) | 5 | 74.0 | 73.2 |
87
+
88
+ ## AI Safety Efforts
89
+
90
+ The Llama-2-7B-DMC-4x model underwent AI safety evaluation including adversarial testing via three distinct methods:
91
+ -[Garak](https://github.com/leondz/garak), is an automated LLM vulnerability scanner that probes for common weaknesses, including prompt injection and data leakage.
92
+ -[AEGIS](https://huggingface.co/datasets/nvidia/Aegis-AI-Content-Safety-Dataset-1.0), is a content safety evaluation dataset and LLM based content safety classifier model, that adheres to a broad taxonomy of 13 categories of critical risks in human-LLM interactions.
93
+ -Human Content Red Teaming leveraging human interaction and evaluation of the models' responses.
94
+
95
+ ## Inference
96
+ **Engine:** Megatron-LM <br>
97
+ **Test Hardware** H100-80GB <br>
98
+
99
+ We recommend running the provided code inside a [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch).
100
+
101
+ 1. First, download a [PyTorch NGC Container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch) using Docker.
102
+ The code below has been tested with the `24.04-py3` version of the container.
103
+
104
+ 2. After setting up the container, clone the repository and install the dependencies:
105
+ ```
106
+ git clone -b dmc https://github.com/NVIDIA/Megatron-LM
107
+ cd Megatron-LM
108
+ pip install -r requirements.txt
109
+ ```
110
+ 3. Download the [Llama 2 tokenizer](https://huggingface.co/meta-llama/Llama-2-7b/blob/main/tokenizer.model) and save it under a desired location `<TOKENIZER_MODEL>`.
111
+
112
+ 4. Download a selected checkpoint and save it under a desired location `<DMC_MODEL>`.
113
+
114
+ 5. We provide code to run and benchmark a simple, auto-regressive inference. Save a single prompt in a textfile and run:
115
+ ```bash
116
+ ./examples/dmc/inference.sh 7B <DMC_MODEL> <TOKENIZER_MODEL> <PROMPT_TXT_FILE>
117
+ ```
118
+
119
+ ## Ethical Considerations
120
+
121
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
122
+
123
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
124
+
125
+ ## Limitations
126
+
127
+ The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. This issue could be exacerbated without the use of the recommended prompt template. If you are going to use this model in an agentic workflow, validate that the imported packages are from a trusted source to ensure end-to-end security.
128
+
129
+ ## Citation
130
+
131
+ If you find this model useful, please cite the following works
132
+
133
+ ```bibtex
134
+ @InProceedings{pmlr-v235-nawrot24a,
135
+ title = {Dynamic Memory Compression: Retrofitting {LLM}s for Accelerated Inference},
136
+ author = {Nawrot, Piotr and {\L}a\'{n}cucki, Adrian and Chochowski, Marcin and Tarjan, David and Ponti, Edoardo},
137
+ booktitle = {Proceedings of the 41st International Conference on Machine Learning},
138
+ pages = {37396--37412},
139
+ year = {2024},
140
+ editor = {Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix},
141
+ volume = {235},
142
+ series = {Proceedings of Machine Learning Research},
143
+ month = {21--27 Jul},
144
+ publisher = {PMLR},
145
+ pdf = {https://raw.githubusercontent.com/mlresearch/v235/main/assets/nawrot24a/nawrot24a.pdf},
146
+ url = {https://proceedings.mlr.press/v235/nawrot24a.html},
147
+ abstract = {Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key–value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key–value cache compression at inference time. Most importantly, the model learns to apply different compression ratios in different heads and layers. We retrofit pre-trained LLMs such as Llama 2 (7B, 13B and 70B) into DMC Transformers, achieving up to $\sim 3.7 \times$ throughput increase during auto-regressive inference on an NVIDIA H100 GPU. DMC is applied via continued pre-training on a negligible percentage of the original data without adding any extra parameters. We find that DMC preserves the original downstream performance with up to 4$\times$ cache compression, outperforming up-trained grouped-query attention (GQA) and key–value eviction policies (H$_2$O, TOVA). GQA and DMC can be even combined to obtain compounded gains. As a result DMC fits longer contexts and larger batches within any given memory budget. We release the DMC code and models at https://github.com/NVIDIA/Megatron-LM/tree/DMC.}
148
+ }
149
+ ```