Converting to native Transformers
This PR converts the model to be used natively within Transformers (see https://github.com/huggingface/transformers/pull/33823)
Will this break compatibility for implementations like llama.cpp
?
Thank you for your support. I will take a look in the next few days, and if it operates normally, we will use this set of specifications to merge into transformers
I found that this code does not run properly, should it be modified this way
define _pad(
self,
encoded_inputs: Union[Dictionary[string, EncodedInput], BatchEncoding],
max_length: Optional[integer] = None,
padding_side: str = "left", # Add this code
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
Right parenthesis, arrow, dict colon
Additionally, the apply_chat_template function has been deprecated, and you can use the one provided by transformers directly. Can this comment be deleted?
一个完整的代码或许可以如下
import regex as re
import base64
import os
import tiktoken
from typing import List, Optional, Union, Dict
from transformers import PreTrainedTokenizer
from transformers.utils import PaddingStrategy
from transformers.tokenization_utils_base import EncodedInput, BatchEncoding
class ChatGLM4Tokenizer(PreTrainedTokenizer):
vocab_files_names = {"vocab_file": "tokenizer.model"}
model_input_names = ["input_ids", "attention_mask", "position_ids"]
def __init__(
self,
vocab_file,
clean_up_tokenization_spaces=False,
**kwargs
):
self.name = "GLM4Tokenizer"
self.vocab_file = vocab_file
pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
self.pat_str = re.compile(pat_str)
mergeable_ranks = {}
with open(vocab_file) as f:
for line in f:
token, rank = line.strip().split()
rank = int(rank)
token = base64.b64decode(token)
mergeable_ranks[token] = rank
self.mergeable_ranks = mergeable_ranks
self.tokenizer = tiktoken.Encoding(
name="my_tokenizer",
pat_str=pat_str,
mergeable_ranks=mergeable_ranks,
special_tokens={}
)
self.decoder = {rank: token for token, rank in mergeable_ranks.items()}
self.n_words = len(self.decoder)
super().__init__(
clean_up_tokenization_spaces=clean_up_tokenization_spaces,
**kwargs
)
@property
def vocab_size(self):
return self.n_words
def get_vocab(self):
""" Returns vocab as a dict """
vocab = {self._convert_id_to_token(i): i for i in range(self.vocab_size)}
vocab.update(self.added_tokens_encoder)
return vocab
def convert_tokens_to_string(self, tokens: List[Union[bytes, str, int]]) -> str:
"""
Converts a sequence of tokens in a single string.
"""
text = ""
temp = b""
for t in tokens:
if isinstance(t, int):
t = chr(t)
if isinstance(t, str):
if temp:
text += temp.decode("utf-8", errors="replace")
elif isinstance(t, bytes):
temp += t
else:
raise TypeError("token should only be of type int, bytes or str")
if temp:
text += temp.decode("utf-8", errors="replace")
return text
def _tokenize(self, text, **kwargs):
tokens = []
ids = self.tokenizer.encode(text)
for t in ids:
tokens.append(self.decoder[t])
return tokens
def _convert_token_to_id(self, token):
""" Converts a token (str) in an id using the vocab. """
return self.mergeable_ranks[token]
def _convert_id_to_token(self, index):
"""Converts an index (integer) in a token (str) using the vocab."""
return self.decoder.get(index, "")
def save_vocabulary(self, save_directory, filename_prefix=None):
"""
Save the vocabulary and special tokens file to a directory.
Args:
save_directory (`str`):
The directory in which to save the vocabulary.
filename_prefix (`str`, *optional*):
An optional prefix to add to the named of the saved files.
Returns:
`Tuple(str)`: Paths to the files saved.
"""
if os.path.isdir(save_directory):
vocab_file = os.path.join(
save_directory, self.vocab_files_names["vocab_file"]
)
else:
vocab_file = save_directory
with open(self.vocab_file, 'rb') as fin:
proto_str = fin.read()
with open(vocab_file, "wb") as writer:
writer.write(proto_str)
return (vocab_file,)
def get_prefix_tokens(self):
prefix_tokens = [self.convert_tokens_to_ids("[gMASK]"), self.convert_tokens_to_ids("<sop>")]
return prefix_tokens
def build_single_message(self, role, metadata, message, tokenize=True):
assert role in ["system", "user", "assistant", "observation"], role
if tokenize:
role_tokens = [self.convert_tokens_to_ids(f"<|{role}|>")] + self.tokenizer.encode(f"{metadata}\n",
disallowed_special=())
message_tokens = self.tokenizer.encode(message, disallowed_special=())
tokens = role_tokens + message_tokens
return tokens
else:
return str(f"<|{role}|>{metadata}\n{message}")
def build_inputs_with_special_tokens(
self, token_ids_0: List[int], token_ids_1: Optional[List[int]] = None
) -> List[int]:
"""
Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and
adding special tokens. A BERT sequence has the following format:
- single sequence: `[CLS] X [SEP]`
- pair of sequences: `[CLS] A [SEP] B [SEP]`
Args:
token_ids_0 (`List[int]`):
List of IDs to which the special tokens will be added.
token_ids_1 (`List[int]`, *optional*):
Optional second list of IDs for sequence pairs.
Returns:
`List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens.
"""
prefix_tokens = self.get_prefix_tokens()
token_ids_0 = prefix_tokens + token_ids_0
if token_ids_1 is not None:
token_ids_0 = token_ids_0 + token_ids_1 + [self.convert_tokens_to_ids("<eos>")]
return token_ids_0
def _pad(
self,
encoded_inputs: Union[Dict[str, EncodedInput], BatchEncoding],
max_length: Optional[int] = None,
padding_side: str = "left",
padding_strategy: PaddingStrategy = PaddingStrategy.DO_NOT_PAD,
pad_to_multiple_of: Optional[int] = None,
return_attention_mask: Optional[bool] = None,
) -> dict:
"""
Pad encoded inputs (on left/right and up to predefined length or max length in the batch)
Args:
encoded_inputs:
Dictionary of tokenized inputs (`List[int]`) or batch of tokenized inputs (`List[List[int]]`).
max_length: maximum length of the returned list and optionally padding length (see below).
Will truncate by taking into account the special tokens.
padding_strategy: PaddingStrategy to use for padding.
- PaddingStrategy.LONGEST Pad to the longest sequence in the batch
- PaddingStrategy.MAX_LENGTH: Pad to the max length (default)
- PaddingStrategy.DO_NOT_PAD: Do not pad
The tokenizer padding sides are defined in self.padding_side:
- 'left': pads on the left of the sequences
- 'right': pads on the right of the sequences
pad_to_multiple_of: (optional) Integer if set will pad the sequence to a multiple of the provided value.
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability
`>= 7.5` (Volta).
return_attention_mask:
(optional) Set to False to avoid returning attention mask (default: set to model specifics)
"""
# Load from model defaults
required_input = encoded_inputs[self.model_input_names[0]]
seq_length = len(required_input)
if padding_strategy == PaddingStrategy.LONGEST:
max_length = len(required_input)
if max_length is not None and pad_to_multiple_of is not None and (max_length % pad_to_multiple_of != 0):
max_length = ((max_length // pad_to_multiple_of) + 1) * pad_to_multiple_of
needs_to_be_padded = padding_strategy != PaddingStrategy.DO_NOT_PAD and len(required_input) != max_length
# Initialize attention mask if not present.
if "attention_mask" not in encoded_inputs:
encoded_inputs["attention_mask"] = [1] * seq_length
if "position_ids" not in encoded_inputs:
encoded_inputs["position_ids"] = list(range(seq_length))
if needs_to_be_padded:
difference = max_length - len(required_input)
if "attention_mask" in encoded_inputs:
encoded_inputs["attention_mask"] = [0] * difference + encoded_inputs["attention_mask"]
if "position_ids" in encoded_inputs:
encoded_inputs["position_ids"] = [0] * difference + encoded_inputs["position_ids"]
encoded_inputs[self.model_input_names[0]] = [self.pad_token_id] * difference + required_input
return encoded_inputs
Hi
@zRzRzRzRzRzRzR
! Not sure how you tested this but the _pad
issue is actually coming from your own version. With the new model added in transformers
, nothing relies on your custom .py
files anymore. Since it is not already merged (will be soon, we are just correcting issues from our automatic file converter internally), you need to install transformers
from the correct branch for now: pip install git+https://github.com/huggingface/transformers.git@glm
. Then, to try it out, specify the revision of this PR on the hub when loading the model:
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
device = 3
tokenizer = AutoTokenizer.from_pretrained('THUDM/glm-4-9b-chat', revision="refs/pr/81")
model = AutoModelForCausalLM.from_pretrained('THUDM/glm-4-9b-chat', torch_dtype=torch.float16, revision="refs/pr/81").to(device)
sequence = 'Hello I am doing'
inputs = tokenizer.encode(sequence, return_tensors='pt').to(device)
out = model.generate(inputs, do_sample=False, max_new_tokens=50)
print(tokenizer.batch_decode(out, skip_special_tokens=True)[0])
Let me know if you still experience issues when doing this. From my tests, everything runs smoothly.
PS: I would advise you to wait that https://github.com/huggingface/transformers/pull/33823 is correctly merged in transformers
before merging this PR on the hub. That way, users will only need to install it from main
to have the model already available.
PPS: Not exactly sure how llama.cpp
works, but if a correct transformers
version (containing the model definition) is installed in the environment, I don't see any reason why it should not work properly.
Yes, I have also noticed this issue, therefore, I fixed the padding problem in the main branch of this code yesterday. There is no need to modify any of your PR, I have already uploaded the modified tokenizer_chatlm.py
code to the main branch of this repository.
Regarding the installation, I have successfully installed the GLM branch and debugged the generation part, and it is working properly. Once this PR is merged into the main branch and a release is published, I will proceed with merging this PR( you can merge the changes I made to tokenizer_chatglm.py
yesterday into this PR)
Could you be more clear about the changes you want to the tokenizer? As I said, with this PR no code relies on your .py
files (which means you could delete them all in this repo, I forgot when opening this PR). The tokenizer is one of our PreTrainedTokenizerFast
(created from your tokenizer.model
), in which I added a post processor to always add your two BOS tokens ([gMASK]<sop>
) automatically. If you want to change this and/or the chat template let me know, but otherwise the inner workings of the tokenizer now rely on our own PreTrainedTokenizerFast
(otherwise you can easily change these small settings of the tokenizer yourself after you merge this).
Oh, I understand what you mean now. No modifications are needed, and any changes you’ve made on GitHub don’t require further adjustments.
The content here https://github.com/huggingface/transformers/pull/33823 doesn’t need any changes. It’s perfectly fine, and I sincerely apologize for the misunderstanding! The chat template doesn’t need any modifications either.
and, the code should like this, not use AutoTokenizer
, using with PreTrainedTokenizerFast
right?
Here is a simple code for dialogue using code
from transformers import PreTrainedTokenizerFast, GlmForCausalLM
device = 3
tokenizer = PreTrainedTokenizerFast.from_pretrained('glm-4-9b-chat')
model = GlmForCausalLM.from_pretrained('glm-4-9b-chat').to(device)
message = [
{
"role": "system",
content": "Answer the following question.
},
{
"role": "user",
content": "How many legs does a cat have?
},
{
"role": "assistant",
content": "A cat has four legs.
}
{
"role": "user",
content": "Is the animal I just asked about a mammal?
}
]
inputs = tokenizer.apply_chat_template(
message,
return_tensors='pt',
add_generation_prompt=True,
return_dict=True
).to(device)
input_len = inputs['input_ids'].shape[1]
generate_kwargs = {
"input_ids": inputs['input_ids'],
"attention_mask": inputs['attention_mask'],
"max_new_tokens": 128,
"do_sample": False,
}
out = model.generate(**generate_kwargs)
print(tokenizer.decode(out[0][input_len:], skip_special_tokens=True))
The model file only contains this content
.
├── config.json
├── configuration.json
├── generation_config.json
├── LICENSE
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── README_en.md
├── README.md
├── tokenizer_config.json
└── tokenizer.json
It is running normally. I checked the template, it is correct. Therefore, no modification is needed.
I believe the misunderstanding has been resolved.
After the code merge, all code except what is described here does not need to be mandatorily retained; it is no longer necessary.
Yes exactly, you are correct concerning the files! However you can still use AutoModelForCausalLM
and AutoTokenizer
to load everything, they will automatically point to the correct classes (if you check config.json
and tokenizer_config.json
you will see that they have a field pointing to the correct class 🤗). They are used to load every models/tokenizers in the same way independently of architectures!
Yes, I saw it, this solution is compatible with my original writing!
@cyrilvallez
@zRzRzRzRzRzRzR
Hi both, since the change has been merged to huggingface. How can we use the config in this pr to use the native hf version of GLM in from_pretrained
?
I noticed that version 4.46 of transformers has not been released yet. If I directly overwrite with the new version, will it prevent the old version of transformers (4.44) from being used? I don't have confirmed testing on this.
@zRzRzRzRzRzRzR
Thanks for the prompt reply! But I wonder can we use what
@cyrilvallez
has made in this repo in from_pretrained
? Just for test purpose.
Ah I see, I can use revision in from_pretrained
: https://huggingface.co./docs/transformers/v4.45.2/en/main_classes/model#transformers.PreTrainedModel.from_pretrained.revision
Using the branch provided by @cyrilvallez works fine in the 4.46 version; I will go to the company tomorrow for the final confirmation.
We found that this code cannot run properly below version 4.45.2, so we are preparing to create a new repository specifically for your version. This new repository will require users to use version 4.46 or above of transformers for inference. Considering that a very large number of open-source frameworks are not yet compatible with transformers 4.46, we have decided not to make any changes to the main branch for now.
@zRzRzRzRzRzRzR
That will be great. Actually, you can create a branch in this repo like hf4.46
etc, so we can avoid any potential confusion that another repo can lead to.
Yes, I proposed two options to my colleagues: creating a new branch and creating a new repository. We ultimately decided to use a new repository, which is named glm-4-9b-hf. This old repository will cease maintenance and will have a prominent notice suggesting the switch to the new repository.
@zRzRzRzRzRzRzR
Please add the prominent notice
to this repo so users are correctly routed to the new -hf models using latest hf transformers. This is getting too confusing for the average user.
I think this is necessary, I am on a business trip abroad, and I will update as soon as I can operate the computer after returning home (20th)
Due to some issues, my schedule has been delayed, and I have marked it on the homepage and discussion.