Initial GPTQ model commit
Browse files
README.md
CHANGED
@@ -31,19 +31,24 @@ quantized_by: TheBloke
|
|
31 |
- Model creator: [Jon Durbin](https://huggingface.co/jondurbin)
|
32 |
- Original model: [Airoboros L2 70B](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1)
|
33 |
|
|
|
34 |
## Description
|
35 |
|
36 |
This repo contains GPTQ model files for [Jon Durbin's Airoboros L2 70B](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1).
|
37 |
|
38 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
39 |
|
|
|
|
|
40 |
## Repositories available
|
41 |
|
42 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ)
|
43 |
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GGUF)
|
44 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GGML)
|
45 |
* [Jon Durbin's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1)
|
|
|
46 |
|
|
|
47 |
## Prompt template: Chat
|
48 |
|
49 |
```
|
@@ -53,6 +58,9 @@ ASSISTANT:
|
|
53 |
|
54 |
```
|
55 |
|
|
|
|
|
|
|
56 |
## Provided files and GPTQ parameters
|
57 |
|
58 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
@@ -66,7 +74,7 @@ All GPTQ files are made with AutoGPTQ.
|
|
66 |
|
67 |
- Bits: The bit size of the quantised model.
|
68 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
69 |
-
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
|
70 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
71 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
72 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
@@ -83,6 +91,9 @@ All GPTQ files are made with AutoGPTQ.
|
|
83 |
| [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 26.77 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
|
84 |
| [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 28.03 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
|
85 |
|
|
|
|
|
|
|
86 |
## How to download from branches
|
87 |
|
88 |
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Airoboros-L2-70B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
|
@@ -91,73 +102,72 @@ All GPTQ files are made with AutoGPTQ.
|
|
91 |
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ
|
92 |
```
|
93 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
94 |
-
|
|
|
95 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
96 |
|
97 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
98 |
|
99 |
-
It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
|
100 |
|
101 |
1. Click the **Model tab**.
|
102 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Airoboros-L2-70B-2.1-GPTQ`.
|
103 |
- To download from a specific branch, enter for example `TheBloke/Airoboros-L2-70B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
|
104 |
- see Provided Files above for the list of branches for each option.
|
105 |
3. Click **Download**.
|
106 |
-
4. The model will start downloading. Once it's finished it will say "Done"
|
107 |
5. In the top left, click the refresh icon next to **Model**.
|
108 |
6. In the **Model** dropdown, choose the model you just downloaded: `Airoboros-L2-70B-2.1-GPTQ`
|
109 |
7. The model will automatically load, and is now ready for use!
|
110 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
111 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
112 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
|
|
113 |
|
|
|
114 |
## How to use this GPTQ model from Python code
|
115 |
|
116 |
-
|
117 |
|
118 |
-
|
119 |
-
pip3 install auto-gptq
|
120 |
-
```
|
121 |
|
122 |
-
|
|
|
|
|
123 |
```
|
|
|
|
|
|
|
|
|
124 |
pip3 uninstall -y auto-gptq
|
125 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
126 |
cd AutoGPTQ
|
127 |
pip3 install .
|
128 |
```
|
129 |
|
130 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
131 |
|
132 |
```python
|
133 |
-
from transformers import AutoTokenizer, pipeline
|
134 |
-
from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
|
135 |
|
136 |
model_name_or_path = "TheBloke/Airoboros-L2-70B-2.1-GPTQ"
|
137 |
-
|
138 |
-
|
|
|
|
|
|
|
|
|
139 |
|
140 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
141 |
|
142 |
-
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
143 |
-
use_safetensors=True,
|
144 |
-
trust_remote_code=False,
|
145 |
-
device="cuda:0",
|
146 |
-
use_triton=use_triton,
|
147 |
-
quantize_config=None)
|
148 |
-
|
149 |
-
"""
|
150 |
-
# To download from a specific branch, use the revision parameter, as in this example:
|
151 |
-
# Note that `revision` requires AutoGPTQ 0.3.1 or later!
|
152 |
-
|
153 |
-
model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
|
154 |
-
revision="gptq-4bit-32g-actorder_True",
|
155 |
-
use_safetensors=True,
|
156 |
-
trust_remote_code=False,
|
157 |
-
device="cuda:0",
|
158 |
-
quantize_config=None)
|
159 |
-
"""
|
160 |
-
|
161 |
prompt = "Tell me about AI"
|
162 |
prompt_template=f'''A chat
|
163 |
USER: {prompt}
|
@@ -173,9 +183,6 @@ print(tokenizer.decode(output[0]))
|
|
173 |
|
174 |
# Inference can also be done using transformers' pipeline
|
175 |
|
176 |
-
# Prevent printing spurious transformers error when using pipeline with AutoGPTQ
|
177 |
-
logging.set_verbosity(logging.CRITICAL)
|
178 |
-
|
179 |
print("*** Pipeline:")
|
180 |
pipe = pipeline(
|
181 |
"text-generation",
|
@@ -189,12 +196,17 @@ pipe = pipeline(
|
|
189 |
|
190 |
print(pipe(prompt_template)[0]['generated_text'])
|
191 |
```
|
|
|
192 |
|
|
|
193 |
## Compatibility
|
194 |
|
195 |
-
The files provided
|
|
|
|
|
196 |
|
197 |
-
|
|
|
198 |
|
199 |
<!-- footer start -->
|
200 |
<!-- 200823 -->
|
@@ -233,7 +245,9 @@ And thank you again to a16z for their generous grant.
|
|
233 |
|
234 |
### Overview
|
235 |
|
236 |
-
__*
|
|
|
|
|
237 |
|
238 |
This is an instruction fine-tuned llama-2 model, using synthetic data generated by [airoboros](https://github.com/jondurbin/airoboros)
|
239 |
|
@@ -253,7 +267,7 @@ This is an instruction fine-tuned llama-2 model, using synthetic data generated
|
|
253 |
- laws vary widely based on time and location
|
254 |
- language model may conflate certain words with laws, e.g. it may think "stealing eggs from a chicken" is illegal
|
255 |
- these models just produce text, what you do with that text is your resonsibility
|
256 |
-
- many people and industries deal with "sensitive" content; imagine if a court stenographer's
|
257 |
|
258 |
### Prompt format
|
259 |
|
|
|
31 |
- Model creator: [Jon Durbin](https://huggingface.co/jondurbin)
|
32 |
- Original model: [Airoboros L2 70B](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1)
|
33 |
|
34 |
+
<!-- description start -->
|
35 |
## Description
|
36 |
|
37 |
This repo contains GPTQ model files for [Jon Durbin's Airoboros L2 70B](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1).
|
38 |
|
39 |
Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
|
40 |
|
41 |
+
<!-- description end -->
|
42 |
+
<!-- repositories-available start -->
|
43 |
## Repositories available
|
44 |
|
45 |
* [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ)
|
46 |
* [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GGUF)
|
47 |
* [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GGML)
|
48 |
* [Jon Durbin's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1)
|
49 |
+
<!-- repositories-available end -->
|
50 |
|
51 |
+
<!-- prompt-template start -->
|
52 |
## Prompt template: Chat
|
53 |
|
54 |
```
|
|
|
58 |
|
59 |
```
|
60 |
|
61 |
+
<!-- prompt-template end -->
|
62 |
+
|
63 |
+
<!-- README_GPTQ.md-provided-files start -->
|
64 |
## Provided files and GPTQ parameters
|
65 |
|
66 |
Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
|
|
|
74 |
|
75 |
- Bits: The bit size of the quantised model.
|
76 |
- GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
|
77 |
+
- Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
|
78 |
- Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
|
79 |
- GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
|
80 |
- Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
|
|
|
91 |
| [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 26.77 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
|
92 |
| [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 28.03 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
|
93 |
|
94 |
+
<!-- README_GPTQ.md-provided-files end -->
|
95 |
+
|
96 |
+
<!-- README_GPTQ.md-download-from-branches start -->
|
97 |
## How to download from branches
|
98 |
|
99 |
- In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Airoboros-L2-70B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
|
|
|
102 |
git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ
|
103 |
```
|
104 |
- In Python Transformers code, the branch is the `revision` parameter; see below.
|
105 |
+
<!-- README_GPTQ.md-download-from-branches end -->
|
106 |
+
<!-- README_GPTQ.md-text-generation-webui start -->
|
107 |
## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
108 |
|
109 |
Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
|
110 |
|
111 |
+
It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
|
112 |
|
113 |
1. Click the **Model tab**.
|
114 |
2. Under **Download custom model or LoRA**, enter `TheBloke/Airoboros-L2-70B-2.1-GPTQ`.
|
115 |
- To download from a specific branch, enter for example `TheBloke/Airoboros-L2-70B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
|
116 |
- see Provided Files above for the list of branches for each option.
|
117 |
3. Click **Download**.
|
118 |
+
4. The model will start downloading. Once it's finished it will say "Done".
|
119 |
5. In the top left, click the refresh icon next to **Model**.
|
120 |
6. In the **Model** dropdown, choose the model you just downloaded: `Airoboros-L2-70B-2.1-GPTQ`
|
121 |
7. The model will automatically load, and is now ready for use!
|
122 |
8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
|
123 |
* Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
|
124 |
9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
|
125 |
+
<!-- README_GPTQ.md-text-generation-webui end -->
|
126 |
|
127 |
+
<!-- README_GPTQ.md-use-from-python start -->
|
128 |
## How to use this GPTQ model from Python code
|
129 |
|
130 |
+
### Install the necessary packages
|
131 |
|
132 |
+
Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
|
|
|
|
|
133 |
|
134 |
+
```shell
|
135 |
+
pip3 install transformers>=4.32.0 optimum>=1.12.0
|
136 |
+
pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
|
137 |
```
|
138 |
+
|
139 |
+
If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
|
140 |
+
|
141 |
+
```shell
|
142 |
pip3 uninstall -y auto-gptq
|
143 |
git clone https://github.com/PanQiWei/AutoGPTQ
|
144 |
cd AutoGPTQ
|
145 |
pip3 install .
|
146 |
```
|
147 |
|
148 |
+
### For CodeLlama models only: you must use Transformers 4.33.0 or later.
|
149 |
+
|
150 |
+
If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
|
151 |
+
```shell
|
152 |
+
pip3 uninstall -y transformers
|
153 |
+
pip3 install git+https://github.com/huggingface/transformers.git
|
154 |
+
```
|
155 |
+
|
156 |
+
### You can then use the following code
|
157 |
|
158 |
```python
|
159 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
|
|
|
160 |
|
161 |
model_name_or_path = "TheBloke/Airoboros-L2-70B-2.1-GPTQ"
|
162 |
+
# To use a different branch, change revision
|
163 |
+
# For example: revision="gptq-4bit-32g-actorder_True"
|
164 |
+
model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
|
165 |
+
torch_dtype=torch.float16,
|
166 |
+
device_map="auto",
|
167 |
+
revision="main")
|
168 |
|
169 |
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
|
170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
171 |
prompt = "Tell me about AI"
|
172 |
prompt_template=f'''A chat
|
173 |
USER: {prompt}
|
|
|
183 |
|
184 |
# Inference can also be done using transformers' pipeline
|
185 |
|
|
|
|
|
|
|
186 |
print("*** Pipeline:")
|
187 |
pipe = pipeline(
|
188 |
"text-generation",
|
|
|
196 |
|
197 |
print(pipe(prompt_template)[0]['generated_text'])
|
198 |
```
|
199 |
+
<!-- README_GPTQ.md-use-from-python end -->
|
200 |
|
201 |
+
<!-- README_GPTQ.md-compatibility start -->
|
202 |
## Compatibility
|
203 |
|
204 |
+
The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
|
205 |
+
|
206 |
+
[ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
|
207 |
|
208 |
+
[Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
|
209 |
+
<!-- README_GPTQ.md-compatibility end -->
|
210 |
|
211 |
<!-- footer start -->
|
212 |
<!-- 200823 -->
|
|
|
245 |
|
246 |
### Overview
|
247 |
|
248 |
+
__*NOTE: The weights have been re-uploaded as of 2023-08-28 06:57PM EST*__
|
249 |
+
|
250 |
+
__*I re-merged the adapter weights (info here: https://twitter.com/jon_durbin/status/1696243076178571474)*__
|
251 |
|
252 |
This is an instruction fine-tuned llama-2 model, using synthetic data generated by [airoboros](https://github.com/jondurbin/airoboros)
|
253 |
|
|
|
267 |
- laws vary widely based on time and location
|
268 |
- language model may conflate certain words with laws, e.g. it may think "stealing eggs from a chicken" is illegal
|
269 |
- these models just produce text, what you do with that text is your resonsibility
|
270 |
+
- many people and industries deal with "sensitive" content; imagine if a court stenographer's equipment filtered illegal content - it would be useless
|
271 |
|
272 |
### Prompt format
|
273 |
|