neuralmagic
/

Meta-Llama-3-8B-Instruct-FP8

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

Meta-Llama-3-8B-Instruct-FP8 / README.md

mgoin's picture

Update README.md

7b86662 verified 8 months ago

|

1.63 kB

	---
	tags:
	- fp8
	---


	Meta-Llama-3-8B-Instruct quantized to FP8 weights and activations using per-tensor quantization, ready for inference with vLLM >= 0.5.0.

	Produced using [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py).

	Accuracy on MMLU:
	```
	vllm (pretrained=meta-llama/Meta-Llama-3-8B-Instruct,gpu_memory_utilization=0.4), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 16
	\| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\|
	\|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\|
	\|mmlu \|N/A \|none \| 0\|acc \|0.6569\|± \|0.0038\|
	\| - humanities \|N/A \|none \| 5\|acc \|0.6049\|± \|0.0068\|
	\| - other \|N/A \|none \| 5\|acc \|0.7203\|± \|0.0078\|
	\| - social_sciences\|N/A \|none \| 5\|acc \|0.7663\|± \|0.0075\|
	\| - stem \|N/A \|none \| 5\|acc \|0.5652\|± \|0.0085\|

	vllm (pretrained=nm-testing/Meta-Llama-3-8B-Instruct-FP8,quantization=fp8,gpu_memory_utilization=0.4), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: 16
	\| Groups \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\|
	\|------------------\|-------\|------\|-----:\|------\|-----:\|---\|-----:\|
	\|mmlu \|N/A \|none \| 0\|acc \|0.6567\|± \|0.0038\|
	\| - humanities \|N/A \|none \| 5\|acc \|0.6072\|± \|0.0068\|
	\| - other \|N/A \|none \| 5\|acc \|0.7206\|± \|0.0078\|
	\| - social_sciences\|N/A \|none \| 5\|acc \|0.7618\|± \|0.0075\|
	\| - stem \|N/A \|none \| 5\|acc \|0.5649\|± \|0.0085\|
	```