neuralmagic
/

Sparse-Llama-3.1-8B-gsm8k-2of4-FP8-dynamic

Text Generation

compressed-tensors

Model card Files Files and versions Community

mgoin commited on Dec 19, 2024

Commit

ce092ac

·

verified ·

1 Parent(s): c6bc9c9

Update README.md

Files changed (1) hide show

README.md +1 -2

README.md CHANGED Viewed

@@ -44,12 +44,11 @@ Only weights and activations of the linear operators within transformers blocks
 Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and BF16 representations for each output channel dimension.
 Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
 Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and BF16 representations.
-The [GPTQ](https://arxiv.org/abs/2210.17323) algorithm is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library.
 ## Deployment with vLLM
-This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ## Evaluation

 Weights are quantized with a symmetric static per-channel scheme, where a fixed linear scaling factor is applied between FP8 and BF16 representations for each output channel dimension.
 Linear scaling factors are computed via by minimizing the mean squarred error (MSE).
 Activations are quantized with a symmetric dynamic per-token scheme, computing a linear scaling factor at runtime for each token between FP8 and BF16 representations.
 ## Deployment with vLLM
+This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend. vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
 ## Evaluation