Continuation of the Discussion: More than 10 minutes the status is in Setting `pad_token_id` to `eos_token_id`:100257 for open-end generation. #28

#31

by Madhugraj - opened Apr 1, 2024

Apr 1, 2024

Adding more details:
First I ran:
tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", trust_remote_code=True, token=auth_token)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True,token=auth_token)
--All 61 files where downloaded.
and then I did:

Model + token save

save_directory = "llmdb/model"
tokenizer.save_pretrained(save_directory)
model.save_pretrained(save_directory)

Later I did:

model = AutoModelForCausalLM.from_pretrained(save_directory, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True)
Loading checkpoint shards: 100%
61/61 [00:13<00:00, 4.69it/s]

Can i understand that the model and tokens are saved successfully?
Now I am running:

input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt")
input_ids

{'input_ids': tensor([[100278, 9125, 198, 2675, 527, 6078, 46913, 11, 3549,
555, 423, 2143, 78889, 13, 1472, 1051, 1566, 6177,
304, 6790, 220, 2366, 18, 13, 1472, 4320, 4860,
3196, 389, 2038, 2561, 709, 311, 430, 1486, 627,
57489, 15843, 36, 66024, 77273, 50, 5257, 66024, 57828,
43486, 2794, 23233, 29863, 11, 719, 3493, 17879, 14847,
311, 810, 6485, 323, 1825, 84175, 4860, 627, 2675,
7945, 449, 5370, 9256, 11, 505, 4477, 311, 11058,
320, 985, 51594, 369, 2082, 10215, 2001, 6227, 311,
1005, 55375, 449, 2082, 11, 4823, 11, 323, 12920,
4390, 7, 2675, 656, 539, 617, 1972, 7394, 828,
2680, 477, 2082, 11572, 17357, 13, 1472, 5766, 23473,
67247, 323, 3493, 24770, 39555, 389, 20733, 13650, 13,
1472, 656, 539, 3493, 5609, 24142, 11, 45319, 11,
477, 3754, 9908, 323, 656, 539, 82791, 713, 3649,
315, 701, 4967, 828, 29275, 2028, 374, 701, 1887,
10137, 11, 51346, 701, 14847, 13, 3234, 539, 5905,
433, 11, 1120, 6013, 311, 279, 1217, 13, 1442,
499, 1505, 6261, 7556, 922, 420, 1984, 11, 3009,
13, 1472, 1288, 387, 30438, 36001, 323, 6118, 430,
3445, 539, 45391, 420, 627, 57489, 9503, 4276, 386,
72983, 4230, 3083, 10245, 45613, 52912, 21592, 66873, 6781,
38873, 3247, 45613, 3507, 20843, 9109, 393, 3481, 691,
1863, 5257, 3247, 14194, 13575, 68235, 13, 100279, 198,
100278, 882, 198, 3923, 1587, 433, 1935, 311, 1977,
264, 2294, 445, 11237, 30, 100279, 198, 100278, 78191,
198]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

Can i understand that the input tokens are generated?

Next:
outputs = model.generate(**input_ids, max_new_tokens=200)
Setting pad_token_id to eos_token_id:100257 for open-end generation.
And now no output and the cell is running for hours...

I am just doing what is instructed in https://github.com/databricks/dbrx/blob/main/MODEL_CARD_dbrx_instruct.md under Run the model on a CPU.

Please explain.

Fuehnix

Apr 1, 2024

I have the same issue, but I'm using the example code for running on GPU.

srowen

Databricks org Apr 1, 2024

What GPUs? Are you sure it's not loading only partly on a GPU? That is what you likely get if use device_map auto and don't have multiple big GPUs

Fuehnix

Apr 1, 2024

What GPUs? Are you sure it's not loading only partly on a GPU? That is what you likely get if use device_map auto and don't have multiple big GPUs

Sorry, I just saw the previously closed thread...

I'm having the same issue, but my machine has 4 x A6000 Ada 48gb GPUs (combined 192 total VRAM) and 512 gb of RAM for CPU. Am I not able to run this model? Is there like a quantized version of it, or some way I can get it to fit?

code I'm using just to test if I can run it:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("databricks/dbrx-instruct", trust_remote_code=True, token="HF_TOKEN")
model = AutoModelForCausalLM.from_pretrained("databricks/dbrx-instruct", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True, attn_implementation="flash_attention_2", token="HF_TOKEN")
input_text = "What does it take to build a great LLM?"
messages = [{"role": "user", "content": input_text}]
input_ids = tokenizer.apply_chat_template(messages, return_dict=True, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
outputs = model.generate(**input_ids, max_new_tokens=200)
print(tokenizer.decode(outputs[0]))

srowen

Databricks org Apr 1, 2024

•

edited Apr 1, 2024

Right, 132B x 16-bit = 264GB of VRAM. Much of the model could load in 192GB, but there would be a perf hit as at least some would be offloaded to CPU. device_map="auto" is doing that here, almost surely. You can do a sense check with nvidia-smi (your mem is likely ~100% full), and calling .hf_device_map on the loaded model to see which devices have loaded which layers, and which are on CPU if any (and I expect some are)

You can check out third-party 4-bit quantizations like https://huggingface.co./PrunaAI/dbrx-instruct-bnb-4bit for example

robinwang

Apr 9, 2024

I have only on Geforce 3090 with mem of 32G and I stuck on the same message. Can someone help?

srowen

Databricks org Apr 9, 2024

As above, that is unfortunately far too little memory to load the model. It's too little to load even the 4-bit quantizations.

robinwang

Apr 11, 2024

Get it. Thank you!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment