Some instructions regarding fine-tuning & classification with ESM++ model
Dear contributors and developers, @lhallee
Thank you for the important and helpful work you are doing by making protein LLMs more accessible to the community!
I try to follow the code in modeling_esm_plusplus.py
in order to perform fine-tuning and protein-level classification downstream on my specific use case.
I am using ESMplusplus_600M()
function for embeddings (also tried with AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large')
) and ESMplusplusForSequenceClassification.from_pretrained_esm("600")
for the classification head. Also using LORA in between.
I already tried several ways to correctly embed my dataset (e.g. model.embed_dataset()
), or just pass just the sequences + labels (inputs_ids, ...) via Dataset object directly to Trainer
function together with a tokenizer
object.
But nothing seems to work, and the training steps will not start at all or will crash after 1-epoch due to inconsistencies in shapes/dimensions and batches between inputs to the model, or to the outputs.
I would be very help for any help/advices/guides, can provide of course the specific code I used or the errors.
Many thanks,
Dani
No problem. Probably GitHub is better.
Sure thank you!
Here is the link to the notebook:
https://github.com/VadimDu/Protein_LLM_modeling/blob/main/clean_ver_Modeling_ESM_plusplus.ipynb
I basically copied all the code from modeling_esm_plusplus.py
there, and added over it my data and steps towards fine-tuning the classification model.
From the cell named "My protein input data" starts the part I added.
In the current trial I commented out data_collator
and tokenizer
from the Trainer, and used the default one and the tokenizer implemented in function ESMplusplus_600M()
, class ESMplusplusForMaskedLM()
, self.tokenizer = EsmSequenceTokenizer()
.
Any help will be much appreciated!
Dani
Hi @lhallee ,
Thanks again for your reply. Could you please clarify regarding AutoModelForSequenceClassification
? I could not find such class/method in your code I am using.
Regarding the errors:
- If I am running the preprocessing and fine-tuning steps exactly how is in the notebook (
model_embedding = ESMplusplus_600M(num_labels=3)
,.embed_dataset()
,model_classification = ESMplusplusForSequenceClassification.from_pretrained_esm("600")
, and no explicit custom data_collator and tokenizer give toTrainer()
), this is the error:
ValueError Traceback (most recent call last)
<ipython-input-31-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()
13 frames
<ipython-input-2-bf6a22e3339a> in forward(self, x, attention_mask, output_hidden_states, output_attentions)
441 TransformerOutput containing last hidden state and optionally all hidden states and attention weights
442 """
--> 443 batch_size, seq_len, _ = x.shape
444 hidden_states = () if output_hidden_states else None
445 attentions = () if output_attentions else None
ValueError: not enough values to unpack (expected 3, got 2)
- If I am adding
class CustomDataCollator
to define a data_collator, to convert myinput_embeds
to shape:torch.Size([num_of_sequences, 1, 1152])
from a 2-dimensional tensor, then 1 training epoch finish OK, and then crashes at the start of the epoch 2:
Could not estimate the number of tokens of the input, floating-point operations will not be computed
[ 4/20 00:00 < 00:03, 4.01 it/s, Epoch 1/10]
Epoch Training Loss Validation Loss
[2/2 00:00]
Downloading builder script: 100%
4.20k/4.20k [00:00<00:00, 506kB/s]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-48-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()
9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
43 except AttributeError:
44 wrap = None
---> 45 result = getattr(asarray(obj), method)(*args, **kwds)
46 if wrap:
47 if not isinstance(result, mu.ndarray):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 3 dimensions. The detected shape was (2, 20, 1) + inhomogeneous part.
I used 20 sequences just as an example for training.
- If use the commented out cell (#@title ESM++ for protein embeddings using a pre-trained model from Synthyra) for sequence Dataset creation and tokenizer with
AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large)'
, and supplyTrainer()
with a tokenizer, I get this error:
ValueError Traceback (most recent call last)
<ipython-input-28-5075ee0329cb> in <cell line: 0>()
11
12 # Train the model
---> 13 trainer.train()
14 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
3477 if size_average is not None or reduce is not None:
3478 reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3479 return torch._C._nn.cross_entropy_loss(
3480 input,
3481 target,
ValueError: Expected input batch_size (2200) to match target batch_size (4).
I hope this information may be helpful, many thanks again for your efforts.
Dani
Gotcha. So a few things. If you want to finetune a model for sequence classification you do not need to pre-embed the sequences. Just need to feed the input_ids
and attention_mask
with the data collator. You can load the model without copying the implementation anywhere by doing this
from transformers import AutoModelForSequenceClassification
model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True)
From here you can apply lora if you'd like.
If you want to just train a model on the vector embeddings of the model, you can embed them like you had and train a small neural network.
Does that make sense?
Here's an example of a collator we use for input_ids and labels. Trainer automatically unpacks a dictionary sent to the model, so everything in "batch" here will go to the right place
def string_labels_collator_builder(tokenizer, **kwargs):
def _collate_fn(batch):
seqs = [ex[0] for ex in batch]
labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
batch = tokenizer(seqs,
padding='longest',
padding_to_multiple_of=8,
truncation=False,
return_tensors='pt',
add_special_tokens=True)
batch['labels'] = labels
return batch
return _collate_fn
tokenizer = model.tokenizer
data_collator = string_labels_collator_builder(tokenizer)
This expects a PyTorch dataset class that will output a tuple of sequences and the labels you are interested in. A class that might link up with your current workflow looks something like this
from torch.utils.data import Dataset as TorchDataset
class StringLabelDatasetFromHF(TorchDataset):
def __init__(self, hf_dataset, col_name='seqs', label_col='labels', **kwargs):
self.seqs = hf_dataset[col_name]
self.labels = hf_dataset[label_col]
self.lengths = [len(seq) for seq in self.seqs]
def avg(self):
return sum(self.lengths) / len(self.lengths)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
seq = self.seqs[idx]
label = self.labels[idx]
return seq, label
Does this help? If you try something new and get a new error please send along.
Hi @lhallee ,
Many thanks again for your help!
I have implemented the collator and PyTorch dataset as you suggested and used AutoModelForSequenceClassification
, but unfortunately after the 1st epoch of the training finished it crashed with a similar error as I had before.
Below I will paste all the relevant code from the start until the training, maybe you can spot some inconsistencies there:
from torch.utils.data import Dataset as TorchDataset
from transformers import AutoModelForSequenceClassification, AutoConfig
config = AutoConfig.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, num_labels=3)
model_classification = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, config=config)
tokenizer = model_classification.tokenizer
# Move models to GPU and keep them in float32
model_classification = model_classification.to(device) # Remove .half()
def string_labels_collator_builder(tokenizer, **kwargs):
def _collate_fn(batch):
seqs = [ex[0] for ex in batch]
labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
batch = tokenizer(seqs,
padding='longest',
truncation=False,
return_tensors='pt',
add_special_tokens=True)
batch['labels'] = labels
return batch
return _collate_fn
class StringLabelDatasetFromHF(TorchDataset):
'''The design pattern of the code uses the PyTorch Dataset class for accessing the sequences and labels during the training loop.'''
def __init__(self, hf_dataset, col_name='sequence', label_col='label', **kwargs):
self.seqs = hf_dataset[col_name].to_numpy() # Convert to NumPy array
self.labels = hf_dataset[label_col].to_numpy() # Convert to NumPy array
self.lengths = [len(seq) for seq in self.seqs]
def avg(self):
return sum(self.lengths) / len(self.lengths)
def __len__(self):
return len(self.seqs)
def __getitem__(self, idx):
seq = self.seqs[idx]
label = self.labels[idx]
return seq, label
torchdataset_my_train = StringLabelDatasetFromHF(my_train)
torchdataset_my_valid = StringLabelDatasetFromHF(my_valid)
torchdataset_my_test = StringLabelDatasetFromHF(my_test)
data_collator = string_labels_collator_builder(tokenizer)
# LORA fine-tuning
# Define the regex pattern to match desired layers (excluding LayerNorm - ffn.0)
pattern = r"transformer\.blocks\.\d+\.(attn\.layernorm_qkv\.1|attn\.out_proj|ffn\.[13])"
target_modules = [
name
for name, module in model_classification.named_modules() # iterate through all modules and their names.
if re.fullmatch(pattern, name)
]
print(f'Target modules for LORA: {target_modules}')
lora_config = LoraConfig(
r=4, # Rank of the LoRA update matrices
lora_alpha=32, # Scaling factor for the LoRA update matrices
lora_dropout=0.05, # Dropout probability for the LoRA update matrices
bias="none", # Whether to apply bias to the LoRA update matrices
task_type=TaskType.SEQ_CLS, # Task type for sequence classification
target_modules=target_modules, # Modules which LORA method should target and modify their weights
)
model = get_peft_model(model_classification, lora_config)
# Prints the number of trainable parameters in the LoRA-adapted model
model.print_trainable_parameters()
# Define Huggingface Trainer arguments
training_args = TrainingArguments(
output_dir="./results",
eval_strategy = "epoch",
logging_strategy = "epoch",
save_strategy = "epoch",
learning_rate=3e-4,
# effective training batch size is batch * accum
# we recommend an effective training batch size of 8
per_device_train_batch_size=4,
per_device_eval_batch_size=16,
gradient_accumulation_steps=2,
num_train_epochs=10,
weight_decay=0.01,
load_best_model_at_end=True,
#deepspeed= ds_config if deepspeed else None,
fp16 = False,
gradient_checkpointing=False,
)
# Metric definition for validation data
def compute_metrics(eval_pred, num_labels=3):
if num_labels>1: # for classification
metric = load("accuracy")
predictions, labels = eval_pred
predictions = np.argmax(predictions, axis=1)
else: # for regression
metric = load("spearmanr")
predictions, labels = eval_pred
return metric.compute(predictions=predictions, references=labels)
# Define the Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=torchdataset_my_train,
eval_dataset=torchdataset_my_valid,
data_collator=data_collator, # the custom data collator
compute_metrics=compute_metrics,
)
# Train the model
trainer.train()
This is the error I got:
ValueError Traceback (most recent call last)
<ipython-input-28-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()
9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
43 except AttributeError:
44 wrap = None
---> 45 result = getattr(asarray(obj), method)(*args, **kwds)
46 if wrap:
47 if not isinstance(result, mu.ndarray):
ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (2, 501) + inhomogeneous part.
I am sorry that we still couldn't resolve the issue! Maybe I am missing something basic or critical, I'm still new to LLMs / Hugging Face api in general.
I can send you a small sample of my data so that you can try yourself if that's OK with you.
Thank you
Dani
Is it happening exactly 1 epoch? This could be an error from the evaluation, likely happening in compute_metrics
. I would write a separate one for regression or classification based on your needs, and pass the correct one when needed. The only argument for compute_metrics
should be an EvalPrediction. You can type hint it like this
from transformers import EvalPrediction
def compute_metrics(p: EvalPrediction):
preds, labels = p.predictions, p.label_ids
# if preds or labels is a tuple you usually need to take the 0th index, I usually add an if statement for this
# etc.
# For example
def compute_metrics_regression(p: EvalPrediction):
"""
Compute various regression metrics for model evaluation.
Args:
(p: EvalPrediction): An object containing predictions and label ids.
Returns:
dict: A dictionary containing the following metrics:
- r_squared: Coefficient of determination
- spearman_rho: Spearman's rank correlation coefficient
- spear_pval: p-value for Spearman's correlation
- pearson_rho: Pearson correlation coefficient
- pear_pval: p-value for Pearson's correlation
- mse: Mean Squared Error
- mae: Mean Absolute Error
- rmse: Root Mean Squared Error
"""
preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
labels = p.label_ids[1] if isinstance(p.label_ids, tuple) else p.label_ids
logits = np.array(preds).flatten()
labels = np.array(labels).flatten()
r2 = r2_score(labels, logits)
spearman_rho, spear_pval = spearmanr(logits, labels)
pearson_rho, pear_pval = pearsonr(logits, labels)
mse = mean_squared_error(labels, logits)
mae = mean_absolute_error(labels, logits)
rmse = np.sqrt(mse)
return {
'r_squared': round(r2, 5),
'spearman_rho': round(spearman_rho, 5),
'spear_pval': round(spear_pval, 5),
'pearson_rho': round(pearson_rho, 5),
'pear_pval': round(pear_pval, 5),
'mse': round(mse, 5),
'mae': round(mae, 5),
'rmse': round(rmse, 5),
}
I also don't think you need the .to_numpy()
in your dataset class. That shouldn't be able to run for a list of strings.
I would be happy to look at a small sample of your data, one or a couple example lines is fine if it is sensitive (you can change the column names too). I can just copy what you send several times if I need more samples. Also, if you could send the full traceback I may be able to debug a bit better. Sometimes an IDE will not show you the whole thing, I don't think it did here. Not sure how to fix that though.
It's great that you are new to LLMs and Huggingface! Welcome to the ecosystem. There is definitely a learning curve but once it clicks it is a fantastic resource for research. Don't get discouraged!
Best,
Logan
Dear Logan,
Many thanks for your inputs and encouragement!
I managed to solve the problem, it was indeed as you pointed a problem in compute_metrics
which was called only after the 1st epoch. The returned object was a tuple and I needed to correctly index it to retrieve the logits.
Regarding .to_numpy()
- I added it as my input in this case was a DataFrame, which was converted to a Series in StringLabelDatasetFromHF()
, and thus couldn't be indexed with [ ].
Now the training was successfully finished :-)
Previously I already fine-tuned a transformer T5 model of ProtT5_xl ProtTrans. It actually worked great, on a relatively simple protein function classification task.
Then I used only the encoder part (half precision) and it was definitely enough, while minimized resources and time spent.
Maybe you know if it is possible to run the current ESM-C 600M param model in half precision as well?
In any case I wanted to try this new model as it is suppose to be SOTA design, and it's training data suppose to be much more extensive and varied (UniRef, MGnify and JGI, while ProtT5-xl was trained only on UniRef50).
Again many thanks for your time!
Best regards
Dani
No problem, glad it seems to be working!
You can absolutely run the current ESMC models in half precision, but training in half precision can be much less stable. We find that float16 inference offers almost no cost in performance, but the full half precision training can be tricky. You can try mixed precision training with the huggingface trainer, which should offer you a good speed up and memory reduction with a tiny cost to performance.
If you are interested in ProtT5-like models, the ANKH series has the same architecture and is better in about every way. Synthyra offers versions of the encoder-only weights in this collection - just look for the ANKH models.
Yea, its hard to tell where the meta-genomic data will help and hinder. Models trained on old uniref versions like ANKH and ESM2 are still just as good or better in many scenarios.
If you have any other questions feel free to ask here. If not, kindly close the issue. Thanks for your use of our Huggingface model versions, keep an eye out for our own product releases soon!!! We will have a variety of protein annotation systems hitting the market this year.