ekurtic commited on
Commit
f4dbba5
·
verified ·
1 Parent(s): 81758d1

Update README

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -42,7 +42,7 @@ This model was obtained by quantizing the weights and activations of [Meta-Llama
42
  This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. In particular, this model can now be loaded and evaluated with a single node of 8xH100 GPUs, as opposed to multiple nodes.
43
 
44
  Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
45
- [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
46
 
47
  ## Deployment
48
 
 
42
  This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%. In particular, this model can now be loaded and evaluated with a single node of 8xH100 GPUs, as opposed to multiple nodes.
43
 
44
  Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
45
+ [LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
46
 
47
  ## Deployment
48