Llama Adapters
Collection
3 items
•
Updated
QAdapter sequential bottleneck adapter for the Llama-2 7B (meta-llama/Llama-2-7b-hf
) model trained for instruction tuning on the timdettmers/openassistant-guanaco dataset.
This adapter was created for usage with the Adapters library.
First, install adapters
:
pip install -U adapters
Now, the model and adapter can be loaded and activated like this:
import adapters
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
model_id = "meta-llama/Llama-2-7b-hf"
adapter_id = "AdapterHub/llama2-7b-qadapter-seq-openassistant"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
),
torch_dtype=torch.bfloat16,
)
adapters.init(model)
adapter_name = model.load_adapter(adapter_id, source="hf", set_active=True)
tokenizer = AutoTokenizer.from_pretrained(model_id)
Inference can be done via standard methods built in to the Transformers library. We add some helper code to properly prompt the model first:
from transformers import StoppingCriteria
# stop if model starts to generate "### Human:"
class EosListStoppingCriteria(StoppingCriteria):
def __init__(self, eos_sequence = [12968, 29901]):
self.eos_sequence = eos_sequence
def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
last_ids = input_ids[:,-len(self.eos_sequence):].tolist()
return self.eos_sequence in last_ids
def prompt_model(model, text: str):
batch = tokenizer(f"### Human: {text} ### Assistant:", return_tensors="pt")
batch = batch.to(model.device)
with torch.cuda.amp.autocast():
output_tokens = model.generate(**batch, stopping_criteria=[EosListStoppingCriteria()])
# skip prompt when decoding
decoded = tokenizer.decode(output_tokens[0, batch["input_ids"].shape[1]:], skip_special_tokens=True)
return decoded[:-10] if decoded.endswith("### Human:") else decoded
Now, to prompt the model:
prompt_model(model, "Please explain NLP in simple terms.")
Training was run with the code in this notebook.
The adapter uses the sequential bottleneck architecture described in Houlsby et al. (2019) and available in Adapters as double_seq_bn
.
The adapter is trained similar to the Guanaco models proposed in the paper: