ibnzterrell
/

Nvidia-Llama-3.1-Nemotron-70B-Instruct-HF-AWQ-INT4

Text Generation

text-generation-inference

Inference Endpoints

4-bit precision

Model card Files Files and versions Community

ibnzterrell commited on Dec 7, 2024

Commit

4639aa6

·

verified ·

1 Parent(s): 601bbc4

Update compatibility in README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -28,7 +28,7 @@ base_model:
 This model was quantized using [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) from FP16 down to INT4 using GEMM kernels, with zero-point quantization and a group size of 128.
-Hardware: Intel Xeon CPU E5-2699A v4 @ 2.40GHz, 256GB of RAM, and 2x NVIDIA RTX 3090. I have only tested this with vLLM, but this should work on any platform that supports LLama 3.1 70B Instruct AWQ INT4. The primary limiting factor seems to be whether the platform supports Rotary Positional Embeddings (RoPE).
 Model usage (inference) information for Transformers, AutoAWQ, Text Generation Interface (TGI), and vLLM , as well as quantization reproduction details, are below.

 This model was quantized using [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) from FP16 down to INT4 using GEMM kernels, with zero-point quantization and a group size of 128.
+Hardware: Intel Xeon CPU E5-2699A v4 @ 2.40GHz, 256GB of RAM, and 2x NVIDIA RTX 3090. This should work on any platform that supports LLama 3.1 70B Instruct AWQ INT4.
 Model usage (inference) information for Transformers, AutoAWQ, Text Generation Interface (TGI), and vLLM , as well as quantization reproduction details, are below.