bad

#4
by sdyy - opened

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

%%time
output = quantized_model.generate(tokenizer("The relationship between humans and AI ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))

output = quantized_model.generate(tokenizer("I'm AQLM, ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)

print(tokenizer.decode(output[0]))

Training Large Language Models in 2bit with aqlm, transformers and PEFT
Open In Colab
Welcome to this notebook that goes through the recent aqlm integration that introduces minimal performance degradation 2bit quantization techniques.

In this notebook, we will learn how to load a large model in 2bit (Mixtral-8x7b) and train it using Google Colab and PEFT library from Hugging Face πŸ€—.

Install the aqlm library

It's the only extra dependency to run AQLM models.
Add [gpu] to install the required CUDA specific dependencies.
Install the latest accelerate and transformers releases to properly support it.

[ ]
1
2
3
4
5
6
7
%%capture
!pip install aqlm[gpu]>=1.1.0
!pip install git+https://github.com/huggingface/peft.git@main
!pip install accelerate>=0.27.0
!pip install git+https://github.com/huggingface/transformers.git@main
!pip install datasets
!pip install bitsandbytes # for 8-bit optimizer only
First let's load the model we are going to use - Mixtral-8x7b! Note that the model itself is around 50GB in half precision

[ ]
1
2
3
4
5
6
7
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", torch_dtype="auto", low_cpu_mem_usage=True)

[ ]
1
2
3
4
5
6
7
from transformers import pipeline

messages = [
{"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model="ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")
pipe(messages)

Next steps:
Add LoRA

To alter model's behavior, we have to make it trainable. We can do that by addind a small set of trainable parameters on top of the untrainable quantized ones.

[ ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from peft import LoraConfig, get_peft_model

config = LoraConfig(
r=8,
lora_alpha=32,
target_modules=["q_prok", "k_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
model.print_trainable_parameters()
model.enable_input_require_grads() # it's needed for gradient checkpointing
trainable params: 3,407,872 || all params: 6,550,261,760 || trainable%: 0.0520
Here we add a trainable adapter ontop of every q_prok, k_proj and o_proj linear layer.

Loading a dataset

Let's load a common dataset, english quotes, to fine tune our model on famous quotes.

[ ]
1
2
3
4
from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

Run the cell below to run the training! For the sake of the demo, we just ran it for few steps just to showcase how to use this integration with existing tools on the HF ecosystem.

[ ]
1
2
3
4
5
6
7
prompt = """how are you?"""

raw_output = pipe(get_prompt(prompt))

parse_text(raw_output)

Next steps:

[ ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import transformers

tokenizer.pad_token = tokenizer.eos_token

trainer = transformers.Trainer(
model=model,
train_dataset=data["train"],
args=transformers.TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
warmup_steps=2,
max_steps=10,
learning_rate=2e-4,
fp16=True,
logging_steps=1,
output_dir="outputs",
optim="adamw_bnb_8bit",
logging_first_step=True,
),
data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False),
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
trainer.train()

Next steps:

[ ]
1

Start coding or generate with AI.

[ ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
from transformers import pipeline, AutoTokenizer

model_id = "ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)

Define a chat template

chat_template = """{% if messages[0]['role'] == 'system' %}{{messages[0]['content']}}{% endif %}
{% for message in messages[1:] %}
{{message['role']}}: {{message['content']}}
{% endfor %}"""

Set the chat template for the tokenizer

tokenizer.chat_template = chat_template

messages = [
{"role": "user", "content": "Who are you?"},
]
pipe = pipeline("text-generation", model=model_id, tokenizer=tokenizer) # Pass the tokenizer explicitly
pipe(messages)

Next steps:

[ ]
1
output = quantized_model.generate(tokenizer("hi", ))

[ ]
1
2
3
4

!huggingface-cli login

_|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
_|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
_|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
_|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
_|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

To log in, `huggingface_hub` requires a token generated from https://huggingface.co./settings/tokens .

Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
The token 1 has been saved to /root/.cache/huggingface/stored_tokens
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: 1

[ ]
1
2
3
4
5
6
7

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf")

[ ]
1
2
3
4

%%time
output = quantized_model.generate(tokenizer("The relationship between humans and AI ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))

Next steps:

[ ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14

ipython-input-1-

from transformers import AutoTokenizer, AutoModelForCausalLM

Define quantized_model and tokenizer in the same cell where they will be used

quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Mixtral-8x7b-AQLM-2Bit-1x16-hf",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True, trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.1")

%%time

Now quantized_model and tokenizer are accessible within this cell

output = quantized_model.generate(tokenizer("The relationship between humans and AI ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))

Next steps:

[ ]
1
!pip install torch --upgrade
Requirement already satisfied: torch in /usr/local/lib/python3.10/dist-packages (2.5.1+cu121)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from torch) (3.16.1)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch) (3.1.4)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch) (2024.9.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch) (3.0.2)

[ ]
1
2
output = quantized_model.generate(tokenizer("The relationship between humans and AI ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))

Next steps:

[ ]
1
2
3
4

!pip install aqlm[gpu]==1.0.1
!pip install git+https://github.com/huggingface/accelerate.git@main
!pip install git+https://github.com/BlackSamorez/transformers.git@aqlm
Collecting aqlm==1.0.1 (from aqlm[gpu]==1.0.1)
Downloading aqlm-1.0.1-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: torch>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from aqlm==1.0.1->aqlm[gpu]==1.0.1) (2.5.1+cu121)
Collecting transformers==4.37.0 (from aqlm==1.0.1->aqlm[gpu]==1.0.1)
Downloading transformers-4.37.0-py3-none-any.whl.metadata (129 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 129.4/129.4 kB 3.3 MB/s eta 0:00:00
Requirement already satisfied: triton>=2.1 in /usr/local/lib/python3.10/dist-packages (from aqlm[gpu]==1.0.1) (3.1.0)
Requirement already satisfied: ninja in /usr/local/lib/python3.10/dist-packages (from aqlm[gpu]==1.0.1) (1.11.1.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (0.26.5)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (2024.9.11)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (2.32.3)
Collecting tokenizers<0.19,>=0.14 (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1)
Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)
Requirement already satisfied: safetensors>=0.3.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (0.4.5)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (4.66.6)
Requirement already satisfied: typing-extensions>=4.8.0 in /usr/local/lib/python3.10/dist-packages (from torch>=2.1.1->aqlm==1.0.1->aqlm[gpu]==1.0.1) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=2.1.1->aqlm==1.0.1->aqlm[gpu]==1.0.1) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=2.1.1->aqlm==1.0.1->aqlm[gpu]==1.0.1) (3.1.4)
Requirement already satisfied: fsspec in /usr/local/lib/python3.10/dist-packages (from torch>=2.1.1->aqlm==1.0.1->aqlm[gpu]==1.0.1) (2024.9.0)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch>=2.1.1->aqlm==1.0.1->aqlm[gpu]==1.0.1) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch>=2.1.1->aqlm==1.0.1->aqlm[gpu]==1.0.1) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=2.1.1->aqlm==1.0.1->aqlm[gpu]==1.0.1) (3.0.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.37.0->aqlm==1.0.1->aqlm[gpu]==1.0.1) (2024.8.30)
Downloading aqlm-1.0.1-py3-none-any.whl (10 kB)
Downloading transformers-4.37.0-py3-none-any.whl (8.4 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 8.4/8.4 MB 29.1 MB/s eta 0:00:00
Downloading tokenizers-0.15.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 33.4 MB/s eta 0:00:00
Installing collected packages: tokenizers, transformers, aqlm
Attempting uninstall: tokenizers
Found existing installation: tokenizers 0.21.0
Uninstalling tokenizers-0.21.0:
Successfully uninstalled tokenizers-0.21.0
Attempting uninstall: transformers
Found existing installation: transformers 4.48.0.dev0
Uninstalling transformers-4.48.0.dev0:
Successfully uninstalled transformers-4.48.0.dev0
Attempting uninstall: aqlm
Found existing installation: aqlm 1.1.6
Uninstalling aqlm-1.1.6:
Successfully uninstalled aqlm-1.1.6
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
sentence-transformers 3.2.1 requires transformers<5.0.0,>=4.41.0, but you have transformers 4.37.0 which is incompatible.
Successfully installed aqlm-1.0.1 tokenizers-0.15.2 transformers-4.37.0
Collecting git+https://github.com/huggingface/accelerate.git@main
Cloning https://github.com/huggingface/accelerate.git (to revision main) to /tmp/pip-req-build-2w0i36ub
Running command git clone --filter=blob:none --quiet https://github.com/huggingface/accelerate.git /tmp/pip-req-build-2w0i36ub
Resolved https://github.com/huggingface/accelerate.git to commit 200c9eb7833cfa505907f6f224ebf5a275aa6d92
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: numpy<3.0.0,>=1.17 in /usr/local/lib/python3.10/dist-packages (from accelerate==1.2.0.dev0) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from accelerate==1.2.0.dev0) (24.2)
Requirement already satisfied: psutil in /usr/local/lib/python3.10/dist-packages (from accelerate==1.2.0.dev0) (5.9.5)
Requirement already satisfied: pyyaml in /usr/local/lib/python3.10/dist-packages (from accelerate==1.2.0.dev0) (6.0.2)
Requirement already satisfied: torch>=1.10.0 in /usr/local/lib/python3.10/dist-packages (from accelerate==1.2.0.dev0) (2.5.1+cu121)
Requirement already satisfied: huggingface_hub>=0.21.0 in /usr/local/lib/python3.10/dist-packages (from accelerate==1.2.0.dev0) (0.26.5)
Requirement already satisfied: safetensors>=0.4.3 in /usr/local/lib/python3.10/dist-packages (from accelerate==1.2.0.dev0) (0.4.5)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from huggingface_hub>=0.21.0->accelerate==1.2.0.dev0) (3.16.1)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub>=0.21.0->accelerate==1.2.0.dev0) (2024.9.0)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from huggingface_hub>=0.21.0->accelerate==1.2.0.dev0) (2.32.3)
Requirement already satisfied: tqdm>=4.42.1 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub>=0.21.0->accelerate==1.2.0.dev0) (4.66.6)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface_hub>=0.21.0->accelerate==1.2.0.dev0) (4.12.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate==1.2.0.dev0) (3.4.2)
Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate==1.2.0.dev0) (3.1.4)
Requirement already satisfied: sympy==1.13.1 in /usr/local/lib/python3.10/dist-packages (from torch>=1.10.0->accelerate==1.2.0.dev0) (1.13.1)
Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.10/dist-packages (from sympy==1.13.1->torch>=1.10.0->accelerate==1.2.0.dev0) (1.3.0)
Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.10.0->accelerate==1.2.0.dev0) (3.0.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub>=0.21.0->accelerate==1.2.0.dev0) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub>=0.21.0->accelerate==1.2.0.dev0) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub>=0.21.0->accelerate==1.2.0.dev0) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->huggingface_hub>=0.21.0->accelerate==1.2.0.dev0) (2024.8.30)
Building wheels for collected packages: accelerate
Building wheel for accelerate (pyproject.toml) ... done
Created wheel for accelerate: filename=accelerate-1.2.0.dev0-py3-none-any.whl size=336425 sha256=fbae703d2b92f11dd8263245220dc476caf6b1f726001b2759029647f011245a
Stored in directory: /tmp/pip-ephem-wheel-cache-kdzsa9oi/wheels/cd/b0/9d/d347fd6b94103ddd03d77cf769a11d5ea7fe8b59dcfdbaeb93
Successfully built accelerate
Installing collected packages: accelerate
Attempting uninstall: accelerate
Found existing installation: accelerate 1.1.1
Uninstalling accelerate-1.1.1:
Successfully uninstalled accelerate-1.1.1
Successfully installed accelerate-1.2.0.dev0
Collecting git+https://github.com/BlackSamorez/transformers.git@aqlm
Cloning https://github.com/BlackSamorez/transformers.git (to revision aqlm) to /tmp/pip-req-build-ijzj5uxn
Running command git clone --filter=blob:none --quiet https://github.com/BlackSamorez/transformers.git /tmp/pip-req-build-ijzj5uxn
Running command git checkout -b aqlm --track origin/aqlm
Switched to a new branch 'aqlm'
Branch 'aqlm' set up to track remote branch 'aqlm' from 'origin'.
Resolved https://github.com/BlackSamorez/transformers.git to commit f2e0ed3abddfdaf5f833f0bcc175a8f569a2c709
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (3.16.1)
Requirement already satisfied: huggingface-hub<1.0,>=0.19.3 in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (0.26.5)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (1.26.4)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (24.2)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (6.0.2)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (2024.9.11)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (2.32.3)
Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (0.15.2)
Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (0.4.5)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.10/dist-packages (from transformers==4.38.0.dev0) (4.66.6)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers==4.38.0.dev0) (2024.9.0)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub<1.0,>=0.19.3->transformers==4.38.0.dev0) (4.12.2)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.38.0.dev0) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.38.0.dev0) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.38.0.dev0) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->transformers==4.38.0.dev0) (2024.8.30)
Building wheels for collected packages: transformers
Building wheel for transformers (pyproject.toml) ... done
Created wheel for transformers: filename=transformers-4.38.0.dev0-py3-none-any.whl size=8462673 sha256=96fcb1fbd0d40361c577ea203a321370f366509900668adf552a94ef7b9bd4b8
Stored in directory: /tmp/pip-ephem-wheel-cache-5qcmg4tv/wheels/15/bb/23/48abdd36ec11de3a5a9696c73e917a1064aef91ccd9edde540
Successfully built transformers
Installing collected packages: transformers
Attempting uninstall: transformers
Found existing installation: transformers 4.37.0
Uninstalling transformers-4.37.0:
Successfully uninstalled transformers-4.37.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
aqlm 1.0.1 requires transformers==4.37.0, but you have transformers 4.38.0.dev0 which is incompatible.
sentence-transformers 3.2.1 requires transformers<5.0.0,>=4.41.0, but you have transformers 4.38.0.dev0 which is incompatible.
Successfully installed transformers-4.38.0.dev0

[ ]
1
2
3
4
5
6
7

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

[ ]
1
2
3
4

%%time
output = quantized_model.generate(tokenizer("The relationship between humans and AI ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
<|begin_of_text|>The relationship between humans and AI Thedef solve49. Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question
CPU times: user 11.8 s, sys: 434 ms, total: 12.2 s
Wall time: 38 s

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1

[ ]
1
https://huggingface.co./meta-llama/Meta-Llama-3-8B-Instruct

Gated model You have been granted access to this model

[ ]
1234567

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")

Next steps:

[ ]
1234

%%time
output = quantized_model.generate(tokenizer("who is ai?", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
<|begin_of_text|>who is ai? ai Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0
CPU times: user 10.4 s, sys: 59.8 ms, total: 10.5 s
Wall time: 10.7 s
aΨ΄ΨΊΨ§Ω„
Double-click (or enter) to edit

[ ]
1
2
3
4
5
6
7

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hf",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hf")

[ ]
1
2
3
4

%%time
output = quantized_model.generate(tokenizer("who is ai?", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))
WARNING:transformers_modules.ISTA-DASLab.Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hf.90cc18c5d0985a9105b1d551515d38e716dfc274.modeling_phi3:You are not running the flash-attention implementation, expect numerical differences.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
who is ai?

Answer:

AI, or artificial intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence. These tasks include problem-solving, learning, and decision-making. AI systems are designed to mimic human cognitive abilities and can be used in various applications, such as virtual assistants, recommendation systems, and autonomous vehicles.

The term "ai" is an acronym for "artificial intelligence." It is commonly used to refer to the broader field of AI, which encompasses various subfields and techn
CPU times: user 9.57 s, sys: 67 ms, total: 9.64 s
Wall time: 12.5 s

[ ]
1

[ ]
1
Ψ΄ΨΊΨ§Ω„

[ ]
1234567

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Qwen2-72B-AQLM-PV-1bit-1x16",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Qwen2-72B-AQLM-PV-1bit-1x16")
Ψ΄ΨΊΨ§Ω„

[ ]
1
2
3
4

%%time
output = quantized_model.generate(tokenizer("who is ai?", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))
Ψ΄ΨΊΨ§Ω„

[ ]
1
2
3
4

%%time
output = quantized_model.generate(tokenizer("what is 3+5=؟", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=10, max_new_tokens=12)
print(tokenizer.decode(output[0]))
what is 3+5=؟ (::erte ;

偢焢/javchen!=( heavilyと一緒にuttgartcling杼
CPU times: user 1min 4s, sys: 54.8 s, total: 1min 59s
Wall time: 10min 36s

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from transformers import BitsAndBytesConfig

Configure BitsAndBytesConfig for 4-bit quantization

bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)

Loading model in pre-set configuration

pretrained_model = AutoModelForCausalLM.from_pretrained(
model_id,
quantization_config=bnb_config,
)

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1

Start coding or generate with AI.

[ ]
1
!rm -rf --force /root/.cache

[ ]
123
!huggingface-cli login

_|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
_|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
_|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
_|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
_|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

To log in, `huggingface_hub` requires a token generated from https://huggingface.co./settings/tokens .

Enter your token (input will not be visible):
Add token as git credential? (Y/n) Y
Token is valid (permission: read).
The token 1 has been saved to /root/.cache/huggingface/stored_tokens
Cannot authenticate through git-credential as no helper is defined on your machine.
You might have to re-authenticate when pushing to the Hugging Face Hub.
Run the following command in your terminal in case you want to set the 'store' credential helper as default.

git config --global credential.helper store

Read https://git-scm.com/book/en/v2/Git-Tools-Credential-Storage for more details.
Token has not been saved to git credential helper.
Your token has been saved to /root/.cache/huggingface/token
Login successful.
The current active token is: 1

[1]
22s
1234567

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Meta-Llama-3-8B-Instruct-AQLM-2Bit-1x16",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct")
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning:
The secret HF_TOKEN does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co./settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/quantizers/auto.py:151: UserWarning: You passed quantization_config or equivalent parameters to from_pretrained but the model you're loading already has a quantization_config attribute. The quantization_config from the model will be prevail.
warnings.warn(warning_msg)
/usr/local/lib/python3.10/dist-packages/huggingface_hub/file_download.py:797: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

[ ]
1234

%%time
output = quantized_model.generate(tokenizer("The relationship between humans and AI ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
<|begin_of_text|>The relationship between humans and AI Thedef solve49. Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question
CPU times: user 11.6 s, sys: 418 ms, total: 12 s
Wall time: 35.8 s

[2]
11s
1
output = quantized_model.generate(tokenizer("I'm AQLM, ", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's attention_mask to obtain reliable results.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(

[4]
0s
1
print(tokenizer.decode(output[0]))
<|begin_of_text|>I'm AQLM, 240Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question: Γƒ0Question

colab t4

good with phi

from transformers import AutoTokenizer, AutoModelForCausalLM
quantized_model = AutoModelForCausalLM.from_pretrained(
"ISTA-DASLab/Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hf",
torch_dtype="auto", device_map="auto", low_cpu_mem_usage=True,trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained("ISTA-DASLab/Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hf")

%%time
output = quantized_model.generate(tokenizer("who is ai?", return_tensors="pt")["input_ids"].cuda(), min_new_tokens=128, max_new_tokens=128)
print(tokenizer.decode(output[0]))

WARNING:transformers_modules.ISTA-DASLab.Phi-3-mini-4k-instruct-AQLM-PV-2Bit-1x16-hf.90cc18c5d0985a9105b1d551515d38e716dfc274.modeling_phi3:You are not running the flash-attention implementation, expect numerical differences.
/usr/local/lib/python3.10/dist-packages/torch/utils/cpp_extension.py:1964: UserWarning: TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation.
If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
warnings.warn(
who is ai?

Answer:

AI, or artificial intelligence, refers to the development of computer systems that can perform tasks that typically require human intelligence. These tasks include problem-solving, learning, and decision-making. AI systems are designed to mimic human cognitive abilities and can be used in various applications, such as virtual assistants, recommendation systems, and autonomous vehicles.

The term "ai" is an acronym for "artificial intelligence." It is commonly used to refer to the broader field of AI, which encompasses various subfields and techn
CPU times: user 9.57 s, sys: 67 ms, total: 9.64 s
Wall time: 12.5 s

Sign up or log in to comment