mgoin commited on
Commit
ce092ac
·
verified ·
1 Parent(s): c6bc9c9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -44,12 +44,11 @@ Only weights and activations of the linear operators within transformers blocks
44
  Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and BF16 representations for each output channel dimension.
45
  Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
46
  Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and BF16 representations.
47
- The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
48
 
49
 
50
  ## Deployment with vLLM
51
 
52
- This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
53
 
54
 
55
  ## Evaluation
 
44
  Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and BF16 representations for each output channel dimension.
45
  Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
46
  Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and BF16 representations.
 
47
 
48
 
49
  ## Deployment with vLLM
50
 
51
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
52
 
53
 
54
  ## Evaluation