uwnlp
/

llama-2-70b-qlora-openorca

Model card Files Files and versions Community

llama-2-70b-qlora-openorca / README.md

artidoro

qlora llama 70b openorca

24a4437 about 1 year ago

preview code

raw

history blame

No virus

5.72 kB

	---
	license: cc-by-nc-4.0
	---

	# QLoRA Instruction Tuned Models

	\| [Paper](https://arxiv.org/abs/2305.14314) \| [Code](https://github.com/artidoro/qlora) \|

	The `LLaMA-2 QLoRA OpenOrca` are open-source models obtained through 4-bit QLoRA tuning of LLaMA-2 base models 240k exmaples of OpenOrca.

	⚠️ These models are purely intended for research purposes and could produce problematic outputs.

	## What are QLoRA Instruction Tuned Models and why use them?
	- Strong performance on MMLU following the QLoRA instruction tuning.
	- Replicable and efficient instruction tuning procedure that can be extended to new use cases. QLoRA training scripts are available in the [QLoRA repo](https://github.com/artidoro/qlora).
	- Rigorous comparison to 16-bit methods (both 16-bit full-finetuning and LoRA) in [our paper](https://arxiv.org/abs/2305.14314) demonstrates the effectiveness of 4-bit QLoRA finetuning.
	- Lightweight checkpoints which only contain adapter weights.

	## License and Intended Use
	Note the use of these adapter weights, requires access to the LLaMA-2 model weighs and therefore should be used according to the LLaMA-2 license.

	## Usage
	Here is an example of how you would load the model 4-bits:
	```python
	import torch
	from peft import PeftModel
	from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

	model_name = "meta-llama/Llama-2-70b-hf"
	adapters_name = 'uwnlp/llama-2-70b-qlora-openorca'

	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	load_in_4bit=True,
	torch_dtype=torch.bfloat16,
	device_map="auto",
	quantization_config=BitsAndBytesConfig(
	load_in_4bit=True,
	bnb_4bit_compute_dtype=torch.bfloat16,
	bnb_4bit_use_double_quant=True,
	bnb_4bit_quant_type='nf4'
	),
	)
	model = PeftModel.from_pretrained(model, adapters_name)
	tokenizer = AutoTokenizer.from_pretrained(model_name)

	```
	Inference can then be performed as usual with HF models as follows:
	```python
	prompt = "Introduce yourself"
	formatted_prompt = (
	f"A chat between a curious human and an artificial intelligence assistant."
	f"The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
	f"### Human: {prompt} ### Assistant:"
	)
	inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda:0")
	outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=20)
	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```
	Expected output similar to the following:
	```
	A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
	### Human: Introduce yourself ### Assistant: I am an artificial intelligence assistant. I am here to help you with any questions you may have.
	```

	## Model Card
	Architecture: The models released here are LoRA adapters to be used on top of LLaMA-2 models. They are added to all layers. For all model sizes, we use $r=64$.

	Base Model: These models use LLaMA-2 as base model. LLaMA is a causal language model pretrained on a large corpus of text. See [LLaMA-2 paper](https://arxiv.org/abs/2307.09288) for more details. Note that these models can inherit biases and limitations of the base model.

	Finetuning Data: These models are finetuned on 240k examples of the [OpenOrca](https://huggingface.co./datasets/Open-Orca/OpenOrca) dataset.


	Languages: The different datasets cover different languages. We direct to the various papers and resources describing the datasets for more details.

	Next, we describe Training and Evaluation details.

	### Training
	QLoRA Instruction Tuned Models are the result of 4-bit QLoRA supervised finetuning on different instruction tuning datasets.

	All models use NormalFloat4 datatype for the base model and LoRA adapters on all linear layers with BFloat16 as computation datatype. We set LoRA $r=64$, $\alpha=16$. We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B/70B models.
	For the finetuning process, we use constant learning rate schedule and paged AdamW optimizer.

	### Training hyperparameters
	\| Parameters \| Dataset \| Batch size \| LR \| Steps \| Source Length \| Target Length \|
	\|------------\|----------\|------------\|------\|-------\|---------------\|---------------\|
	\| 7B \| All \| 16 \| 2e-4 \| 10000 \| 384 \| 128 \|
	\| 13B \| All \| 16 \| 2e-4 \| 10000 \| 384 \| 128 \|
	\| 70B \| All \| 64 \| 1e-4 \| 2500 \| 384 \| 128 \|

	### Evaluation
	We use the MMLU benchmark to measure performance on a range of language understanding tasks. This is a multiple-choice benchmark covering 57 tasks including elementary mathematics, US history, computer science, law, and more. We report 5-shot test accuracy.

	Dataset \| 7B \| 13B \| 33B \| 65B
	---\|---\|---\|---\|---
	LLaMA-1 no tuning \| 35.1 \| 46.9 \| 57.8 \| 63.4
	Self-Instruct \| 36.4 \| 33.3 \| 53.0 \| 56.7
	Longform \| 32.1 \| 43.2 \| 56.6 \| 59.7
	Chip2 \| 34.5 \| 41.6 \| 53.6 \| 59.8
	HH-RLHF \| 34.9 \| 44.6 \| 55.8 \| 60.1
	Unnatural Instruct \| 41.9 \| 48.1 \| 57.3 \| 61.3
	OASST1 (Guanaco) \| 36.6 \| 46.4 \| 57.0 \| 62.2
	Alpaca \| 38.8 \| 47.8 \| 57.3 \| 62.5
	FLAN v2 \| 44.5 \| 51.4 \| 59.2 \| 63.9

	Dataset \| 7B \| 13B \| 34B \| 70B
	---\|---\|---\|---\|---
	LLaMA-2 no tuning \| 45.3 \| 54.8 \| 62.6 \| 68.9
	OpenOrca \| 45.0 \| \| \| 69.0


	## Citation

	```bibtex
	@article{dettmers2023qlora,
	title={QLoRA: Efficient Finetuning of Quantized LLMs},
	author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
	journal={arXiv preprint arXiv:2305.14314},
	year={2023}
	}
	```