TheBloke commited on
Commit
2ba0415
1 Parent(s): 853ed59

Initial GPTQ model commit

Browse files
Files changed (1) hide show
  1. README.md +54 -40
README.md CHANGED
@@ -31,19 +31,24 @@ quantized_by: TheBloke
31
  - Model creator: [Jon Durbin](https://huggingface.co/jondurbin)
32
  - Original model: [Airoboros L2 70B](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1)
33
 
 
34
  ## Description
35
 
36
  This repo contains GPTQ model files for [Jon Durbin's Airoboros L2 70B](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1).
37
 
38
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
39
 
 
 
40
  ## Repositories available
41
 
42
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ)
43
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GGUF)
44
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GGML)
45
  * [Jon Durbin's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1)
 
46
 
 
47
  ## Prompt template: Chat
48
 
49
  ```
@@ -53,6 +58,9 @@ ASSISTANT:
53
 
54
  ```
55
 
 
 
 
56
  ## Provided files and GPTQ parameters
57
 
58
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
@@ -66,7 +74,7 @@ All GPTQ files are made with AutoGPTQ.
66
 
67
  - Bits: The bit size of the quantised model.
68
  - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
69
- - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have issues with models that use Act Order plus Group Size.
70
  - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
71
  - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
72
  - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
@@ -83,6 +91,9 @@ All GPTQ files are made with AutoGPTQ.
83
  | [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 26.77 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
84
  | [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 28.03 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
85
 
 
 
 
86
  ## How to download from branches
87
 
88
  - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Airoboros-L2-70B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
@@ -91,73 +102,72 @@ All GPTQ files are made with AutoGPTQ.
91
  git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ
92
  ```
93
  - In Python Transformers code, the branch is the `revision` parameter; see below.
94
-
 
95
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
96
 
97
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
98
 
99
- It is strongly recommended to use the text-generation-webui one-click-installers unless you know how to make a manual install.
100
 
101
  1. Click the **Model tab**.
102
  2. Under **Download custom model or LoRA**, enter `TheBloke/Airoboros-L2-70B-2.1-GPTQ`.
103
  - To download from a specific branch, enter for example `TheBloke/Airoboros-L2-70B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
104
  - see Provided Files above for the list of branches for each option.
105
  3. Click **Download**.
106
- 4. The model will start downloading. Once it's finished it will say "Done"
107
  5. In the top left, click the refresh icon next to **Model**.
108
  6. In the **Model** dropdown, choose the model you just downloaded: `Airoboros-L2-70B-2.1-GPTQ`
109
  7. The model will automatically load, and is now ready for use!
110
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
111
  * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
112
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
 
113
 
 
114
  ## How to use this GPTQ model from Python code
115
 
116
- First make sure you have [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ) 0.3.1 or later installed:
117
 
118
- ```
119
- pip3 install auto-gptq
120
- ```
121
 
122
- If you have problems installing AutoGPTQ, please build from source instead:
 
 
123
  ```
 
 
 
 
124
  pip3 uninstall -y auto-gptq
125
  git clone https://github.com/PanQiWei/AutoGPTQ
126
  cd AutoGPTQ
127
  pip3 install .
128
  ```
129
 
130
- Then try the following example code:
 
 
 
 
 
 
 
 
131
 
132
  ```python
133
- from transformers import AutoTokenizer, pipeline, logging
134
- from auto_gptq import AutoGPTQForCausalLM, BaseQuantizeConfig
135
 
136
  model_name_or_path = "TheBloke/Airoboros-L2-70B-2.1-GPTQ"
137
-
138
- use_triton = False
 
 
 
 
139
 
140
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
141
 
142
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
143
- use_safetensors=True,
144
- trust_remote_code=False,
145
- device="cuda:0",
146
- use_triton=use_triton,
147
- quantize_config=None)
148
-
149
- """
150
- # To download from a specific branch, use the revision parameter, as in this example:
151
- # Note that `revision` requires AutoGPTQ 0.3.1 or later!
152
-
153
- model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
154
- revision="gptq-4bit-32g-actorder_True",
155
- use_safetensors=True,
156
- trust_remote_code=False,
157
- device="cuda:0",
158
- quantize_config=None)
159
- """
160
-
161
  prompt = "Tell me about AI"
162
  prompt_template=f'''A chat
163
  USER: {prompt}
@@ -173,9 +183,6 @@ print(tokenizer.decode(output[0]))
173
 
174
  # Inference can also be done using transformers' pipeline
175
 
176
- # Prevent printing spurious transformers error when using pipeline with AutoGPTQ
177
- logging.set_verbosity(logging.CRITICAL)
178
-
179
  print("*** Pipeline:")
180
  pipe = pipeline(
181
  "text-generation",
@@ -189,12 +196,17 @@ pipe = pipeline(
189
 
190
  print(pipe(prompt_template)[0]['generated_text'])
191
  ```
 
192
 
 
193
  ## Compatibility
194
 
195
- The files provided will work with AutoGPTQ (CUDA and Triton modes), GPTQ-for-LLaMa (only CUDA has been tested), and Occ4m's GPTQ-for-LLaMa fork.
 
 
196
 
197
- ExLlama works with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
 
198
 
199
  <!-- footer start -->
200
  <!-- 200823 -->
@@ -233,7 +245,9 @@ And thank you again to a16z for their generous grant.
233
 
234
  ### Overview
235
 
236
- __*I haven't tested this at all yet, quality could be great or absolute trash, I really don't know, but feel free to try.*__
 
 
237
 
238
  This is an instruction fine-tuned llama-2 model, using synthetic data generated by [airoboros](https://github.com/jondurbin/airoboros)
239
 
@@ -253,7 +267,7 @@ This is an instruction fine-tuned llama-2 model, using synthetic data generated
253
  - laws vary widely based on time and location
254
  - language model may conflate certain words with laws, e.g. it may think "stealing eggs from a chicken" is illegal
255
  - these models just produce text, what you do with that text is your resonsibility
256
- - many people and industries deal with "sensitive" content; imagine if a court stenographer's eqipment filtered illegal content - it would be useless
257
 
258
  ### Prompt format
259
 
 
31
  - Model creator: [Jon Durbin](https://huggingface.co/jondurbin)
32
  - Original model: [Airoboros L2 70B](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1)
33
 
34
+ <!-- description start -->
35
  ## Description
36
 
37
  This repo contains GPTQ model files for [Jon Durbin's Airoboros L2 70B](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1).
38
 
39
  Multiple GPTQ parameter permutations are provided; see Provided Files below for details of the options provided, their parameters, and the software used to create them.
40
 
41
+ <!-- description end -->
42
+ <!-- repositories-available start -->
43
  ## Repositories available
44
 
45
  * [GPTQ models for GPU inference, with multiple quantisation parameter options.](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ)
46
  * [2, 3, 4, 5, 6 and 8-bit GGUF models for CPU+GPU inference](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GGUF)
47
  * [2, 3, 4, 5, 6 and 8-bit GGML models for CPU+GPU inference (deprecated)](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GGML)
48
  * [Jon Durbin's original unquantised fp16 model in pytorch format, for GPU inference and for further conversions](https://huggingface.co/jondurbin/airoboros-l2-70b-2.1)
49
+ <!-- repositories-available end -->
50
 
51
+ <!-- prompt-template start -->
52
  ## Prompt template: Chat
53
 
54
  ```
 
58
 
59
  ```
60
 
61
+ <!-- prompt-template end -->
62
+
63
+ <!-- README_GPTQ.md-provided-files start -->
64
  ## Provided files and GPTQ parameters
65
 
66
  Multiple quantisation parameters are provided, to allow you to choose the best one for your hardware and requirements.
 
74
 
75
  - Bits: The bit size of the quantised model.
76
  - GS: GPTQ group size. Higher numbers use less VRAM, but have lower quantisation accuracy. "None" is the lowest possible value.
77
+ - Act Order: True or False. Also known as `desc_act`. True results in better quantisation accuracy. Some GPTQ clients have had issues with models that use Act Order plus Group Size, but this is generally resolved now.
78
  - Damp %: A GPTQ parameter that affects how samples are processed for quantisation. 0.01 is default, but 0.1 results in slightly better accuracy.
79
  - GPTQ dataset: The dataset used for quantisation. Using a dataset more appropriate to the model's training can improve quantisation accuracy. Note that the GPTQ dataset is not the same as the dataset used to train the model - please refer to the original model repo for details of the training dataset(s).
80
  - Sequence Length: The length of the dataset sequences used for quantisation. Ideally this is the same as the model sequence length. For some very long sequence models (16+K), a lower sequence length may have to be used. Note that a lower sequence length does not limit the sequence length of the quantised model. It only impacts the quantisation accuracy on longer inference sequences.
 
91
  | [gptq-3bit--1g-actorder_True](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ/tree/gptq-3bit--1g-actorder_True) | 3 | None | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 26.77 GB | No | 3-bit, with Act Order and no group size. Lowest possible VRAM requirements. May be lower quality than 3-bit 128g. |
92
  | [gptq-3bit-128g-actorder_True](https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ/tree/gptq-3bit-128g-actorder_True) | 3 | 128 | Yes | 0.1 | [wikitext](https://huggingface.co/datasets/wikitext/viewer/wikitext-2-v1/test) | 4096 | 28.03 GB | No | 3-bit, with group size 128g and act-order. Higher quality than 128g-False but poor AutoGPTQ CUDA speed. |
93
 
94
+ <!-- README_GPTQ.md-provided-files end -->
95
+
96
+ <!-- README_GPTQ.md-download-from-branches start -->
97
  ## How to download from branches
98
 
99
  - In text-generation-webui, you can add `:branch` to the end of the download name, eg `TheBloke/Airoboros-L2-70B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
 
102
  git clone --single-branch --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/Airoboros-L2-70B-2.1-GPTQ
103
  ```
104
  - In Python Transformers code, the branch is the `revision` parameter; see below.
105
+ <!-- README_GPTQ.md-download-from-branches end -->
106
+ <!-- README_GPTQ.md-text-generation-webui start -->
107
  ## How to easily download and use this model in [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
108
 
109
  Please make sure you're using the latest version of [text-generation-webui](https://github.com/oobabooga/text-generation-webui).
110
 
111
+ It is strongly recommended to use the text-generation-webui one-click-installers unless you're sure you know how to make a manual install.
112
 
113
  1. Click the **Model tab**.
114
  2. Under **Download custom model or LoRA**, enter `TheBloke/Airoboros-L2-70B-2.1-GPTQ`.
115
  - To download from a specific branch, enter for example `TheBloke/Airoboros-L2-70B-2.1-GPTQ:gptq-4bit-32g-actorder_True`
116
  - see Provided Files above for the list of branches for each option.
117
  3. Click **Download**.
118
+ 4. The model will start downloading. Once it's finished it will say "Done".
119
  5. In the top left, click the refresh icon next to **Model**.
120
  6. In the **Model** dropdown, choose the model you just downloaded: `Airoboros-L2-70B-2.1-GPTQ`
121
  7. The model will automatically load, and is now ready for use!
122
  8. If you want any custom settings, set them and then click **Save settings for this model** followed by **Reload the Model** in the top right.
123
  * Note that you do not need to set GPTQ parameters any more. These are set automatically from the file `quantize_config.json`.
124
  9. Once you're ready, click the **Text Generation tab** and enter a prompt to get started!
125
+ <!-- README_GPTQ.md-text-generation-webui end -->
126
 
127
+ <!-- README_GPTQ.md-use-from-python start -->
128
  ## How to use this GPTQ model from Python code
129
 
130
+ ### Install the necessary packages
131
 
132
+ Requires: Transformers 4.32.0 or later, Optimum 1.12.0 or later, and AutoGPTQ 0.4.2 or later.
 
 
133
 
134
+ ```shell
135
+ pip3 install transformers>=4.32.0 optimum>=1.12.0
136
+ pip3 install auto-gptq --extra-index-url https://huggingface.github.io/autogptq-index/whl/cu118/ # Use cu117 if on CUDA 11.7
137
  ```
138
+
139
+ If you have problems installing AutoGPTQ using the pre-built wheels, install it from source instead:
140
+
141
+ ```shell
142
  pip3 uninstall -y auto-gptq
143
  git clone https://github.com/PanQiWei/AutoGPTQ
144
  cd AutoGPTQ
145
  pip3 install .
146
  ```
147
 
148
+ ### For CodeLlama models only: you must use Transformers 4.33.0 or later.
149
+
150
+ If 4.33.0 is not yet released when you read this, you will need to install Transformers from source:
151
+ ```shell
152
+ pip3 uninstall -y transformers
153
+ pip3 install git+https://github.com/huggingface/transformers.git
154
+ ```
155
+
156
+ ### You can then use the following code
157
 
158
  ```python
159
+ from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
 
160
 
161
  model_name_or_path = "TheBloke/Airoboros-L2-70B-2.1-GPTQ"
162
+ # To use a different branch, change revision
163
+ # For example: revision="gptq-4bit-32g-actorder_True"
164
+ model = AutoModelForCausalLM.from_pretrained(model_name_or_path,
165
+ torch_dtype=torch.float16,
166
+ device_map="auto",
167
+ revision="main")
168
 
169
  tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)
170
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
171
  prompt = "Tell me about AI"
172
  prompt_template=f'''A chat
173
  USER: {prompt}
 
183
 
184
  # Inference can also be done using transformers' pipeline
185
 
 
 
 
186
  print("*** Pipeline:")
187
  pipe = pipeline(
188
  "text-generation",
 
196
 
197
  print(pipe(prompt_template)[0]['generated_text'])
198
  ```
199
+ <!-- README_GPTQ.md-use-from-python end -->
200
 
201
+ <!-- README_GPTQ.md-compatibility start -->
202
  ## Compatibility
203
 
204
+ The files provided are tested to work with AutoGPTQ, both via Transformers and using AutoGPTQ directly. They should also work with [Occ4m's GPTQ-for-LLaMa fork](https://github.com/0cc4m/KoboldAI).
205
+
206
+ [ExLlama](https://github.com/turboderp/exllama) is compatible with Llama models in 4-bit. Please see the Provided Files table above for per-file compatibility.
207
 
208
+ [Huggingface Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) is compatible with all GPTQ models.
209
+ <!-- README_GPTQ.md-compatibility end -->
210
 
211
  <!-- footer start -->
212
  <!-- 200823 -->
 
245
 
246
  ### Overview
247
 
248
+ __*NOTE: The weights have been re-uploaded as of 2023-08-28 06:57PM EST*__
249
+
250
+ __*I re-merged the adapter weights (info here: https://twitter.com/jon_durbin/status/1696243076178571474)*__
251
 
252
  This is an instruction fine-tuned llama-2 model, using synthetic data generated by [airoboros](https://github.com/jondurbin/airoboros)
253
 
 
267
  - laws vary widely based on time and location
268
  - language model may conflate certain words with laws, e.g. it may think "stealing eggs from a chicken" is illegal
269
  - these models just produce text, what you do with that text is your resonsibility
270
+ - many people and industries deal with "sensitive" content; imagine if a court stenographer's equipment filtered illegal content - it would be useless
271
 
272
  ### Prompt format
273