TheBloke
/

WizardLM-Uncensored-Falcon-40B-GPTQ

@@ -29,24 +29,54 @@ It is the result of quantising to 4bit using [AutoGPTQ](https://github.com/PanQi
 * [3bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-3bit-GPTQ).
 * [Eric's float16 HF format model for GPU inference and further conversions](https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-40b).
 ## EXPERIMENTAL
 Please note this is an experimental GPTQ model. Support for it is currently quite limited.
 It is also expected to be **VERY SLOW**. This is unavoidable at the moment, but is being looked at.
 To use it you will require:
-1. Python 3.10.11
-2. AutoGPTQ v0.2.1 (see below)
-3. Pytorch Stable with CUDA 11.8 (`pip install torch  --index-url https://download.pytorch.org/whl/cu118`)
 4. einops (`pip install einops`)
-You can then use it immediately from Python code - see example code below - or from text-generation-webui.
 ## AutoGPTQ
-You should install AutoGPTQ of version v0.2.1, thus you can try compiling manually from source:
 ```
 git clone https://github.com/PanQiWei/AutoGPTQ
@@ -85,25 +115,6 @@ output = model.generate(input_ids=tokens, max_new_tokens=100, do_sample=True, te
 print(tokenizer.decode(output[0]))
 ```
-## text-generation-webui
-There is also provisional AutoGPTQ support in text-generation-webui.
-This requires a text-generation-webui version of commit `204731952ae59d79ea3805a425c73dd171d943c3` or newer.
-So please first update text-generation-webui to the latest version.
-### How to download and use this model in text-generation-webui
-1. Launch text-generation-webui with the following command-line arguments: `--autogptq --trust-remote-code`
-2. Click the **Model tab**.
-3. Under **Download custom model or LoRA**, enter `TheBloke/WizardLM-Uncensored-Falcon-40B-GPTQ`.
-4. Click **Download**.
-5. Wait until it says it's finished downloading.
-6. Click the **Refresh** icon next to **Model** in the top left.
-7. In the **Model drop-down**: choose the model you just downloaded, `WizardLM-Uncensored-Falcon-40B-GPTQ`.
-8. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
 ## Provided files
 **gptq_model-4bit--1g.safetensors**
@@ -122,22 +133,13 @@ It was created without group_size to reduce VRAM usage, and with `desc_act` (act
 ## FAQ
-### Prompt template
-Prompt format is WizardLM.
-```
-What is a falcon?  Can I keep one as a pet?
-### Response:
-```
 ### About `trust-remote-code`
 Please be aware that this command line argument causes Python code provided by Falcon to be executed on your machine.
 This code is required at the moment because Falcon is too new to be supported by Hugging Face transformers. At some point in the future transformers will support the model natively, and then `trust_remote_code` will no longer be needed.
-In this repo you can see two `.py` files - these are the files that get executed. They are copied from the base repo at [Falcon-7B-Instruct](https://huggingface.co/tiiuae/falcon-7b-instruct).
 <!-- footer start -->
 ## Discord

 * [3bit GPTQ model for GPU inference](https://huggingface.co/TheBloke/WizardLM-Uncensored-Falcon-40B-3bit-GPTQ).
 * [Eric's float16 HF format model for GPU inference and further conversions](https://huggingface.co/ehartford/WizardLM-Uncensored-Falcon-40b).
+### Prompt template
+Prompt format is WizardLM.
+```
+What is a falcon?  Can I keep one as a pet?
+### Response:
+```
 ## EXPERIMENTAL
 Please note this is an experimental GPTQ model. Support for it is currently quite limited.
 It is also expected to be **VERY SLOW**. This is unavoidable at the moment, but is being looked at.
+## text-generation-webui
+There is also provisional AutoGPTQ support in text-generation-webui.
+This requires a text-generation-webui version of commit `204731952ae59d79ea3805a425c73dd171d943c3` or newer.
+So please first update text-generation-webui to the latest version.
+### How to download and use this model in text-generation-webui
+1. Launch text-generation-webui
+2. Click the **Model tab**.
+3. Under **Download custom model or LoRA**, enter `TheBloke/WizardLM-Uncensored-Falcon-40B-GPTQ`.
+4. Click **Download**.
+5. Wait until it says it's finished downloading.
+6. Tick **Trust Remote Code**
+7. Click the **Refresh** icon next to **Model** in the top left.
+8. In the **Model drop-down**: choose the model you just downloaded, `WizardLM-Uncensored-Falcon-40B-GPTQ`.
+9. Once it says it's loaded, click the **Text Generation tab** and enter a prompt!
+## Python inference
 To use it you will require:
+1. AutoGPTQ v0.2.1 (see below)
+2. pytorch 2.0.0 with CUDA 11.7 or 11.8 (eg `pip install torch  --index-url https://download.pytorch.org/whl/cu118`)
 4. einops (`pip install einops`)
 ## AutoGPTQ
+You should install AutoGPTQ of version v0.2.1. There are currently problems with automatic installation with `pip install auto-gptq`.
+Therefore it is recommended to compile manually from source:
 ```
 git clone https://github.com/PanQiWei/AutoGPTQ
 print(tokenizer.decode(output[0]))
 ```
 ## Provided files
 **gptq_model-4bit--1g.safetensors**
 ## FAQ
 ### About `trust-remote-code`
 Please be aware that this command line argument causes Python code provided by Falcon to be executed on your machine.
 This code is required at the moment because Falcon is too new to be supported by Hugging Face transformers. At some point in the future transformers will support the model natively, and then `trust_remote_code` will no longer be needed.
+In this repo you can see two `.py` files - these are the files that get executed. They are copied from the base repo at [Falcon-40B-Instruct](https://huggingface.co/tiiuae/falcon-40b-instruct).
 <!-- footer start -->
 ## Discord