eachadea
/

ggml-vicuna-7b-1.1

Model card Files Files and versions Community

eachadea commited on Apr 27, 2023

Commit

4c0bf74

•

1 Parent(s): 376d070

Update README.md

Files changed (1) hide show

README.md +33 -11

README.md CHANGED Viewed

@@ -3,18 +3,40 @@ license: apache-2.0
 inference: true
 ---
-**NOTE: This GGML conversion is primarily for use with llama.cpp.**
 - PR #896 was used for q4_0. Everything else is latest as of upload time.
-- A warning for q4_2 and q4_3: These are WIP. Do not expect any kind of backwards compatibility until they are finalized.
-- 13B can be found here: https://huggingface.co/eachadea/ggml-vicuna-13b-1.1
-- **Choosing the right model:**
-  - `ggml-vicuna-7b-1.1-q4_0` - Fast, lacks in accuracy.
-  - `ggml-vicuna-7b-1.1-q4_1` - More accurate, lacks in speed.
-  - `ggml-vicuna-7b-1.1-q4_2` - Pretty much a better `q4_0`. Similarly fast, but more accurate.
-  - `ggml-vicuna-7b-1.1-q4_3` - Pretty much a better `q4_1`. More accurate, still slower.
-  - `ggml-vicuna-7b-1.0-uncensored` - Available in `q4_2` and `q4_3`, is an uncensored/unfiltered variant of the model. It is based on the previous release and still uses the `### Human:` syntax. Avoid unless you need it.
 ---

 inference: true
 ---
+### Links
+- [13B version of this model](https://huggingface.co/eachadea/ggml-vicuna-13b-1.1)
+- [Set up with gpt4all-chat (one-click setup, available in in-app download menu)](https://gpt4all.io/index.html)
+- [Set up with llama.cpp](https://github.com/ggerganov/llama.cpp)
+- [Set up with oobabooga/text-generation-webui](https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md)
+### Info
+- Main files are based on v1.1 release
+  - See changelog below
+  - Use prompt template: ```HUMAN: <prompt> ASSISTANT: <response>```
+- Uncensored files are based on v0 release
+  - Use prompt template: ```### User: <prompt> ### Assistant: <response>```
 - PR #896 was used for q4_0. Everything else is latest as of upload time.
+### Quantization
+Several quantization methods are supported. They differ in the resulting model disk size and inference speed.
+Model | F16 | Q4_0 | Q4_1 | Q4_2 | Q4_3 | Q5_0 | Q5_1 | Q8_0
+-- | -- | -- | -- | -- | -- | -- | -- | --
+7B (ppl) | 5.9565 | 6.2103 | 6.1286 | 6.1698 | 6.0617 | 6.0139 | 5.9934 | 5.9571
+7B (size) | 13.0G | 4.0G | 4.8G | 4.0G | 4.8G | 4.4G | 4.8G | 7.1G
+7B (ms/tok @ 4th) | 128 | 56 | 61 | 84 | 91 | 91 | 95 | 75
+7B (ms/tok @ 8th) | 128 | 47 | 55 | 48 | 53 | 53 | 59 | 75
+7B (bpw) | 16.0 | 5.0 | 6.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0
+-- | -- | -- | -- | -- | -- | -- | -- | --
+13B (ppl) | 5.2455 | 5.3748 | 5.3471 | 5.3433 | 5.3234 | 5.2768 | 5.2582 | 5.2458
+13B (size) | 25.0G | 7.6G | 9.1G | 7.6G | 9.1G | 8.4G | 9.1G | 14G
+13B (ms/tok @ 4th) | 239 | 104 | 113 | 160 | 175 | 176 | 185 | 141
+13B (ms/tok @ 8th) | 240 | 85 | 99 | 97 | 114 | 108 | 117 | 147
+13B (bpw) | 16.0 | 5.0 | 6.0 | 5.0 | 6.0 | 5.5 | 6.0 | 9.0
+q5_1 or 5_0 are the latest and most performant implementations. The former is slightly more accurate at the cost of a bit of performance. Most users should use one of the two.
+If you encounter any kind of compatibility issues, you might want to try the older q4_x
 ---