|
--- |
|
license: cc-by-nc-4.0 |
|
--- |
|
|
|
# QLoRA Instruction Tuned Models |
|
|
|
| [Paper](https://arxiv.org/abs/2305.14314) | [Code](https://github.com/artidoro/qlora) | |
|
|
|
**The `LLaMA-2 QLoRA OpenOrca` are open-source models obtained through 4-bit QLoRA tuning of LLaMA-2 base models 240k exmaples of [OpenOrca](https://huggingface.co./datasets/Open-Orca/OpenOrca).** |
|
|
|
⚠️ These models are purely intended for research purposes and could produce problematic outputs. |
|
|
|
## What are QLoRA Instruction Tuned Models and why use them? |
|
- **Strong performance on MMLU** following the QLoRA instruction tuning. |
|
- **Replicable and efficient instruction tuning procedure** that can be extended to new use cases. QLoRA training scripts are available in the [QLoRA repo](https://github.com/artidoro/qlora). |
|
- **Rigorous comparison to 16-bit methods** (both 16-bit full-finetuning and LoRA) in [our paper](https://arxiv.org/abs/2305.14314) demonstrates the effectiveness of 4-bit QLoRA finetuning. |
|
- **Lightweight** checkpoints which only contain adapter weights. |
|
|
|
## License and Intended Use |
|
Note the use of these adapter weights, requires access to the LLaMA-2 model weighs and therefore should be used according to the LLaMA-2 license. The adapter weights are trained on data obtained from OpenAI GPT-3.5 and GPT-4 models (see more details in the Finetuning Data section). As such any use of these adapters should follow their license. |
|
|
|
## Usage |
|
Here is an example of how you would load the model 4-bits: |
|
```python |
|
import torch |
|
from peft import PeftModel |
|
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig |
|
|
|
model_name = "meta-llama/Llama-2-70b-hf" |
|
adapters_name = 'uwnlp/llama-2-70b-qlora-openorca' |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
load_in_4bit=True, |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto", |
|
quantization_config=BitsAndBytesConfig( |
|
load_in_4bit=True, |
|
bnb_4bit_compute_dtype=torch.bfloat16, |
|
bnb_4bit_use_double_quant=True, |
|
bnb_4bit_quant_type='nf4' |
|
), |
|
) |
|
model = PeftModel.from_pretrained(model, adapters_name) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
``` |
|
Inference can then be performed as usual with HF models as follows: |
|
```python |
|
question = "Explain Einstein's theory of special relativity." |
|
formatted_prompt = ( |
|
f"### Instruction: {question}\n\n### Response:" |
|
) |
|
inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda:0") |
|
outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=20) |
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |
|
``` |
|
|
|
## Model Card |
|
**Architecture**: The models released here are LoRA adapters to be used on top of LLaMA-2 models. They are added to all layers. For all model sizes, we use $r=64$. |
|
|
|
**Base Model**: These models use LLaMA-2 as base model. LLaMA is a causal language model pretrained on a large corpus of text. See [LLaMA-2 paper](https://arxiv.org/abs/2307.09288) for more details. Note that these models can inherit biases and limitations of the base model. |
|
|
|
**Finetuning Data**: These models are finetuned on 240k examples of the [OpenOrca](https://huggingface.co./datasets/Open-Orca/OpenOrca) dataset. The OpenOrca dataset is a replica of the [Orca](https://arxiv.org/abs/2306.02707) dataset which uses FLAN v2 prompts and GPT3.5/4 completions. |
|
|
|
|
|
**Languages**: The different datasets cover different languages. We direct to the various papers and resources describing the datasets for more details. |
|
|
|
Next, we describe Training and Evaluation details. |
|
|
|
### Training |
|
QLoRA Instruction Tuned Models are the result of 4-bit QLoRA supervised finetuning on different instruction tuning datasets. |
|
|
|
All models use NormalFloat4 datatype for the base model and LoRA adapters on all linear layers with BFloat16 as computation datatype. We set LoRA $r=64$, $\alpha=16$. We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B/70B models. |
|
For the finetuning process, we use constant learning rate schedule and paged AdamW optimizer. |
|
|
|
### Training hyperparameters |
|
| Parameters | Dataset | Batch size | LR | Steps | Source Length | Target Length | |
|
|------------|----------|------------|------|-------|---------------|---------------| |
|
| 7B | OpenOrca | 16 | 2e-4 | 15000 | 384 | 128 | |
|
| 13B | OpenOrca | 16 | 2e-4 | 15000 | 384 | 128 | |
|
| 70B | OpenOrca | 64 | 1e-4 | 3750 | 384 | 128 | |
|
|
|
### Evaluation |
|
We use the MMLU benchmark to measure performance on a range of language understanding tasks. This is a multiple-choice benchmark covering 57 tasks including elementary mathematics, US history, computer science, law, and more. We report 5-shot test accuracy. |
|
|
|
Dataset | 7B | 13B | 34B | 70B |
|
---|---|---|---|--- |
|
LLaMA-2 no tuning | 45.3 | 54.8 | 62.6 | 68.9 |
|
OpenOrca | 45.0 | | | 69.0 |
|
|
|
For reference here are the MMLU results of QLoRA finetuning on other datasets: |
|
|
|
Dataset | 7B | 13B | 33B | 65B |
|
---|---|---|---|--- |
|
LLaMA-1 no tuning | 35.1 | 46.9 | 57.8 | 63.4 |
|
Self-Instruct | 36.4 | 33.3 | 53.0 | 56.7 |
|
Longform | 32.1 | 43.2 | 56.6 | 59.7 |
|
Chip2 | 34.5 | 41.6 | 53.6 | 59.8 |
|
HH-RLHF | 34.9 | 44.6 | 55.8 | 60.1 |
|
Unnatural Instruct | 41.9 | 48.1 | 57.3 | 61.3 |
|
OASST1 (Guanaco) | 36.6 | 46.4 | 57.0 | 62.2 |
|
Alpaca | 38.8 | 47.8 | 57.3 | 62.5 |
|
FLAN v2 | 44.5 | 51.4 | 59.2 | 63.9 |
|
|
|
|
|
## Citation |
|
|
|
```bibtex |
|
@article{dettmers2023qlora, |
|
title={QLoRA: Efficient Finetuning of Quantized LLMs}, |
|
author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke}, |
|
journal={arXiv preprint arXiv:2305.14314}, |
|
year={2023} |
|
} |
|
``` |