Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,23 @@
|
|
1 |
-
---
|
2 |
-
license: llama3.1
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: llama3.1
|
3 |
+
tags:
|
4 |
+
- java
|
5 |
+
- llama
|
6 |
+
- llama3
|
7 |
+
- gguf
|
8 |
+
- llama3.java
|
9 |
+
---
|
10 |
+
|
11 |
+
# GGUF models for llama3.java
|
12 |
+
Pure .gguf `Q4_0` and `Q8_0` quantizations of Llama 3 8B instruct, ready to consume by [llama3.java](https://github.com/mukel/llama3.java).
|
13 |
+
|
14 |
+
In the wild, `Q8_0` quantizations are fine, but `Q4_0` quantizations are rarely pure e.g. the `output.weights` tensor is quantized with `Q6_K`, instead of `Q4_0`.
|
15 |
+
A pure `Q4_0` quantization can be generated from a high precision (F32, F16, BFLOAT16) .gguf source with the quantize utility from llama.cpp as follows:
|
16 |
+
|
17 |
+
```
|
18 |
+
./quantize --pure ./Meta-Llama-3-8B-Instruct-F32.gguf ./Meta-Llama-3-8B-Instruct-Q4_0.gguf Q4_0
|
19 |
+
```
|
20 |
+
|
21 |
+
# Meta-Llama-3.1-8B-Instruct-GGUF
|
22 |
+
|
23 |
+
- This is GGUF quantized version of [meta-llama/Meta-Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct) created using llama.cpp
|