Converting to native Transformers

#17
by cyrilvallez HF staff - opened
No description provided.

This PR converts the model to be used natively within Transformers (see https://github.com/huggingface/transformers/pull/33823)

cyrilvallez changed pull request title from Upload folder using huggingface_hub to Converting to native Transformers

This PR may behave unexpectedly.

To reproduce:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    # "THUDM/glm-4-9b-chat-1m", revision="refs/pr/17",
    "THUDM/glm-4-9b-chat-1m",
    device_map="cuda",
    torch_dtype="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)
# tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", revision="refs/pr/17", )
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", trust_remote_code=True)

# input = "Hello, how are you?"
# input_encoding = tokenizer(input, return_tensors="pt").to("cuda")

import pickle
with open("test_input.pkl", "rb") as f:
    input_ids = pickle.load(f)

input_encoding = torch.tensor([input_ids]).to("cuda")
print(input_encoding.shape)
print(input_encoding.dtype)

out = model.generate(input_encoding, max_new_tokens=20)
print(tokenizer.decode(out[0, len(input_ids):], skip_special_tokens=True))

The original repo works fine:

torch.Size([1, 98796])
torch.int64
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
**The paper investigates the properties of order-divisor graphs associated with finite groups, providing a comprehensive description of**
(base) aiscuser@node-0:/scratch/MInference$ 

But this PR collapses as follows:

torch.Size([1, 98796])
torch.int64
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
 **the 2. 2, the 2. 2, the 2. 2**
(base) aiscuser@node-0:/scratch/MInference$

This error appears with lengthy input, in my case the input is ~100K len.

@cyrilvallez @zRzRzRzRzRzRzR may need a double check here.

My transformers version: transformers==4.46.0.dev0

Could you check when generating from the text instead of importing the input_ids from file? That is instead of doing:

import pickle
with open("test_input.pkl", "rb") as f:
    input_ids = pickle.load(f)

do

with open("text.txt", "rb") as f:
    text = load(...)

input_ids = tokenizer.encode(text, return_tensors='pt').to(device)

I suspect this may come from slight changes in the tokenizer

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

This model will also have a new repository created for it, used for adaptation

@cyrilvallez Hi Cyril, I re-test the hf native version, as you suggested. And the error remains. The tokenizer seems to behave consistently, so I have no idea where is the bug: https://huggingface.co./THUDM/glm-4-9b-chat-1m-hf/discussions/1.

You can also find the test example I used in the above link.

@cyrilvallez Hi Cyril, this is PR does not work at the first place. I suspect you did not do any long-context test on your PR.
Maybe you can share your weights converting script so we can help you review.


My test script:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    # "THUDM/glm-4-9b-chat-1m", revision="refs/pr/17",
    "THUDM/glm-4-9b-chat-1m",
    device_map="cuda",
    torch_dtype="auto",
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
)
# tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", revision="refs/pr/17", )
tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", trust_remote_code=True)

with open("t.txt", "r") as f:
    input_ids = tokenizer.encode(f.read())

input_encoding = torch.tensor([input_ids]).to("cuda")
print(input_encoding.shape)
print(input_encoding.dtype)

out = model.generate(input_encoding, max_new_tokens=100)
print(tokenizer.decode(out[0, len(input_ids):], skip_special_tokens=True))

And the behaviour differ between the original model (second) and you PR (first), see below.

(base) [email protected]@GCRAZGDL1694:~/MInference$  cd /home/v-yuchengli/MInference ; /usr/bin/env /home/v-yuchengli/miniconda3/envs/llm/bin/python /home/v-yuchengli/.cursor-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 47591 -- /home/v-yuchengli/MInference/t.py
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████| 4/4 [00:04<00:00,  1.12s/it]
torch.Size([1, 137369])
====================
6f6f6b7f6c7f6b6b6b6c7f6f6f6b6f6b6c7f6b6b6c7f6c7c7c7c7c7c7c7c7c7c7c7c7c7c7f6c7c7c7c7c7c7c4b6c7c7c7c7c

(base) [email protected]@GCRAZGDL1694:~/MInference$  cd /home/v-yuchengli/MInference ; /usr/bin/env /home/v-yuchengli/miniconda3/envs/llm/bin/python /home/v-yuchengli/.cursor-server/extensions/ms-python.debugpy-2024.6.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 44467 -- /home/v-yuchengli/MInference/t.py
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████| 10/10 [03:40<00:00, 22.05s/it]
torch.Size([1, 137369])
====================
\"cb59052b-9128-4979-9c0e-e1de4adcf73b\"The value associated with the specified key is "cb59052b-9128-4979-9c0e-e1de4adcf73b". The key you provided is "6ab6ea3e-f288-4f33-ba46-7f42bb75b03f". The value associated with

Hey @liyucheng ! I suspect the error may come from these 2 lines: https://github.com/huggingface/transformers/blob/main/src/transformers/models/glm/modeling_glm.py#L169-L170
Could you try without them (just plainly remove them) and let me know?

@cyrilvallez Hi Cyril, I tried but did not work. I re-implement the apply_rope_func with the original GLM implementation.


def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
    cos = cos.unsqueeze(unsqueeze_dim)
    sin = sin.unsqueeze(unsqueeze_dim)

    # Interleave them instead of usual shape
    # cos = cos[..., : cos.shape[-1] // 2].repeat_interleave(2, dim=-1)
    # sin = sin[..., : sin.shape[-1] // 2].repeat_interleave(2, dim=-1)
    cos = cos[..., : cos.shape[-1] // 2]
    sin = sin[..., : sin.shape[-1] // 2]

    # Keep half for later concatenation
    q, q_pass = q[..., : q.shape[-1] // 2], q[..., q.shape[-1] // 2 :]
    k, k_pass = k[..., : k.shape[-1] // 2], k[..., k.shape[-1] // 2 :]

    # Apply rotary embeddings on the first half
    # q_embed = (q * cos) + (rotate_half(q) * sin)
    # k_embed = (k * cos) + (rotate_half(k) * sin)
    qshaped = q.reshape(q.shape[0], q.shape[1], -1, q.shape[-1] // 2, 2)
    kshaped = k.reshape(k.shape[0], k.shape[1], -1, k.shape[-1] // 2, 2)
    q_embed = torch.stack(
        [
            qshaped[..., 0] * cos - qshaped[..., 1] * sin,
            qshaped[..., 0] * sin + qshaped[..., 1] * cos,
        ],
        dim=-1,
    )
    k_embed = torch.stack(
        [
            kshaped[..., 0] * cos - kshaped[..., 1] * sin,
            kshaped[..., 0] * sin + kshaped[..., 1] * cos,
        ],
        dim=-1,
    )
    q_embed = q_embed.flatten(3)
    k_embed = k_embed.flatten(3)
    # Concatenate back to full shape
    q_embed = torch.cat([q_embed, q_pass], dim=-1)
    k_embed = torch.cat([k_embed, k_pass], dim=-1)
    return q_embed, k_embed

It does not work neither. Do you think the bug is from the model weights?

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

We submitted a new pull request concerning the GLM-Edge model. In GLM-Edge, this implementation has certain modifications and satisfies expectations in performance testing.
This PR has been merged

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

However, regarding GLM-4, it is still the original implementation as mentioned in this link.

@liyucheng thanks for checking it out! I'm fairly confident the model definition is mathematically equivalent to the one in the original code (I took quite some time looking at it at the time) -- rope was my best guess for where I could have made a mistake. Of course, this does not mean that something did not slip past me, if you're willing to check it's always better to make sure.
But given that both my tests passed at the time, and that the new version also seem to work well according to @zRzRzRzRzRzRzR , I'd say the issue is either one of the following:

  • very small differences (due to shapes) that accumulate (with such long context, it's gonna accumulate a lot)
  • conversion of the weights (but unlikely as I used the same script for all the conversions), or something in the config?

You can maybe start by re-converting the weights and check again? You can use this script for it. It was since modified to convert the new version as well, but should still work for the old one

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment