Quantized Llama 3 8B Instruct to Q40 format supported by Distributed Llama.
License
Before download this repository please accept Llama 3 Community License.
How to run
- Clone this repository.
- Clone Distributed Llama:
git clone https://github.com/b4rtaz/distributed-llama.git
- Build Distributed Llama:
make main
- Run Distributed Llama:
./main inference --prompt "Hello world" --steps 128 --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --model path/to/dllama_meta-llama-3-8b_q40.bin --tokenizer path/to/dllama_meta-llama3-tokenizer.t
Chat Template
Please keep in mind this model expects the prompt to use the chat template of llama 3.
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
HF Inference API was unable to determine this model's library.