Some instructions regarding fine-tuning & classification with ESM++ model

by DaniDubi - opened 8 days ago

8 days ago

Dear contributors and developers, @lhallee

Thank you for the important and helpful work you are doing by making protein LLMs more accessible to the community!

I try to follow the code in modeling_esm_plusplus.py in order to perform fine-tuning and protein-level classification downstream on my specific use case.
I am using ESMplusplus_600M()function for embeddings (also tried with AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large')) and ESMplusplusForSequenceClassification.from_pretrained_esm("600") for the classification head. Also using LORA in between.
I already tried several ways to correctly embed my dataset (e.g. model.embed_dataset()), or just pass just the sequences + labels (inputs_ids, ...) via Dataset object directly to Trainer function together with a tokenizer object.
But nothing seems to work, and the training steps will not start at all or will crash after 1-epoch due to inconsistencies in shapes/dimensions and batches between inputs to the model, or to the outputs.

I would be very help for any help/advices/guides, can provide of course the specific code I used or the errors.
Many thanks,
Dani

lhallee

Synthyra org 8 days ago

Hi @DaniDubi ,

Thanks for you interest in the models. Please paste your code and GitHub link and I'll be happy to take a look 🙂
Best,
Logan

DaniDubi

8 days ago

Hi @lhallee ,

Thank you for the fast response! Can you clarify how do you prefer, do you want me to paste the relevant code here, and send you my GitHub repo?

lhallee

Synthyra org 8 days ago

No problem. Probably GitHub is better.

DaniDubi

8 days ago

•

edited 8 days ago

Sure thank you!

Here is the link to the notebook:
https://github.com/VadimDu/Protein_LLM_modeling/blob/main/clean_ver_Modeling_ESM_plusplus.ipynb

I basically copied all the code from modeling_esm_plusplus.py there, and added over it my data and steps towards fine-tuning the classification model.

From the cell named "My protein input data" starts the part I added.
In the current trial I commented out data_collator and tokenizer from the Trainer, and used the default one and the tokenizer implemented in function ESMplusplus_600M(), class ESMplusplusForMaskedLM(), self.tokenizer = EsmSequenceTokenizer().

Any help will be much appreciated!
Dani

lhallee

Synthyra org 7 days ago

Hey @DaniDubi ,

Sorry for the delay. Keep in mind you can get this model and the implementation by using AutoModelForSequenceClassification.

Upon initially looking through your code I don't see anything inherently wrong. Could you share what error you are getting?

DaniDubi

6 days ago

•

edited 6 days ago

Hi @lhallee ,

Thanks again for your reply. Could you please clarify regarding AutoModelForSequenceClassification? I could not find such class/method in your code I am using.

Regarding the errors:

If I am running the preprocessing and fine-tuning steps exactly how is in the notebook (model_embedding = ESMplusplus_600M(num_labels=3), .embed_dataset(), model_classification = ESMplusplusForSequenceClassification.from_pretrained_esm("600"), and no explicit custom data_collator and tokenizer give to Trainer()), this is the error:

ValueError                                Traceback (most recent call last)
<ipython-input-31-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()

13 frames
<ipython-input-2-bf6a22e3339a> in forward(self, x, attention_mask, output_hidden_states, output_attentions)
    441             TransformerOutput containing last hidden state and optionally all hidden states and attention weights
    442         """
--> 443         batch_size, seq_len, _ = x.shape
    444         hidden_states = () if output_hidden_states else None
    445         attentions = () if output_attentions else None

ValueError: not enough values to unpack (expected 3, got 2)

If I am adding class CustomDataCollator to define a data_collator, to convert my input_embeds to shape: torch.Size([num_of_sequences, 1, 1152]) from a 2-dimensional tensor, then 1 training epoch finish OK, and then crashes at the start of the epoch 2:

Could not estimate the number of tokens of the input, floating-point operations will not be computed
 [ 4/20 00:00 < 00:03, 4.01 it/s, Epoch 1/10]
Epoch	Training Loss	Validation Loss
 [2/2 00:00]
Downloading builder script: 100%
 4.20k/4.20k [00:00<00:00, 506kB/s]
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-48-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()

9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     43     except AttributeError:
     44         wrap = None
---> 45     result = getattr(asarray(obj), method)(*args, **kwds)
     46     if wrap:
     47         if not isinstance(result, mu.ndarray):

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 3 dimensions. The detected shape was (2, 20, 1) + inhomogeneous part.

I used 20 sequences just as an example for training.

If use the commented out cell (#@title ESM++ for protein embeddings using a pre-trained model from Synthyra) for sequence Dataset creation and tokenizer with AutoModelForMaskedLM.from_pretrained('Synthyra/ESMplusplus_large)', and supply Trainer() with a tokenizer, I get this error:

ValueError                                Traceback (most recent call last)
<ipython-input-28-5075ee0329cb> in <cell line: 0>()
     11 
     12 # Train the model
---> 13 trainer.train()

14 frames
/usr/local/lib/python3.11/dist-packages/torch/nn/functional.py in cross_entropy(input, target, weight, size_average, ignore_index, reduce, reduction, label_smoothing)
   3477     if size_average is not None or reduce is not None:
   3478         reduction = _Reduction.legacy_get_string(size_average, reduce)
-> 3479     return torch._C._nn.cross_entropy_loss(
   3480         input,
   3481         target,

ValueError: Expected input batch_size (2200) to match target batch_size (4).

I hope this information may be helpful, many thanks again for your efforts.
Dani

lhallee

Synthyra org 6 days ago

•

edited 6 days ago

Gotcha. So a few things. If you want to finetune a model for sequence classification you do not need to pre-embed the sequences. Just need to feed the input_ids and attention_mask with the data collator. You can load the model without copying the implementation anywhere by doing this

from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True)

From here you can apply lora if you'd like.

If you want to just train a model on the vector embeddings of the model, you can embed them like you had and train a small neural network.

Does that make sense?

Here's an example of a collator we use for input_ids and labels. Trainer automatically unpacks a dictionary sent to the model, so everything in "batch" here will go to the right place

def string_labels_collator_builder(tokenizer, **kwargs):
    def _collate_fn(batch):
        seqs = [ex[0] for ex in batch]
        labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
        batch = tokenizer(seqs,
                          padding='longest',
                          padding_to_multiple_of=8,
                          truncation=False,
                          return_tensors='pt',
                          add_special_tokens=True)
        batch['labels'] = labels
        return batch
    return _collate_fn

tokenizer = model.tokenizer
data_collator = string_labels_collator_builder(tokenizer)

This expects a PyTorch dataset class that will output a tuple of sequences and the labels you are interested in. A class that might link up with your current workflow looks something like this

from torch.utils.data import Dataset as TorchDataset

class StringLabelDatasetFromHF(TorchDataset):    
    def __init__(self, hf_dataset, col_name='seqs', label_col='labels', **kwargs):
        self.seqs = hf_dataset[col_name]
        self.labels = hf_dataset[label_col]
        self.lengths = [len(seq) for seq in self.seqs]

    def avg(self):
        return sum(self.lengths) / len(self.lengths)

    def __len__(self):
        return len(self.seqs)
    
    def __getitem__(self, idx):
        seq = self.seqs[idx]
        label = self.labels[idx]
        return seq, label

Does this help? If you try something new and get a new error please send along.

DaniDubi

5 days ago

Hi @lhallee ,

Many thanks again for your help!

I have implemented the collator and PyTorch dataset as you suggested and used AutoModelForSequenceClassification, but unfortunately after the 1st epoch of the training finished it crashed with a similar error as I had before.

Below I will paste all the relevant code from the start until the training, maybe you can spot some inconsistencies there:

from torch.utils.data import Dataset as TorchDataset
from transformers import AutoModelForSequenceClassification, AutoConfig

config = AutoConfig.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, num_labels=3)
model_classification = AutoModelForSequenceClassification.from_pretrained('Synthyra/ESMplusplus_large', trust_remote_code=True, config=config)
tokenizer = model_classification.tokenizer

# Move models to GPU and keep them in float32
model_classification = model_classification.to(device)  # Remove .half()


def string_labels_collator_builder(tokenizer, **kwargs):
    def _collate_fn(batch):
        seqs = [ex[0] for ex in batch]
        labels = torch.stack([torch.tensor(ex[1]) for ex in batch])
        batch = tokenizer(seqs,
                          padding='longest',
                          truncation=False,
                          return_tensors='pt',
                          add_special_tokens=True)
        batch['labels'] = labels
        return batch
    return _collate_fn


class StringLabelDatasetFromHF(TorchDataset):
    '''The design pattern of the code uses the PyTorch Dataset class for accessing the sequences and labels during the training loop.'''
    def __init__(self, hf_dataset, col_name='sequence', label_col='label', **kwargs):
        self.seqs = hf_dataset[col_name].to_numpy() # Convert to NumPy array
        self.labels = hf_dataset[label_col].to_numpy() # Convert to NumPy array
        self.lengths = [len(seq) for seq in self.seqs]

    def avg(self):
        return sum(self.lengths) / len(self.lengths)

    def __len__(self):
        return len(self.seqs)
    
    def __getitem__(self, idx):
        seq = self.seqs[idx]
        label = self.labels[idx]
        return seq, label


torchdataset_my_train = StringLabelDatasetFromHF(my_train)
torchdataset_my_valid = StringLabelDatasetFromHF(my_valid)
torchdataset_my_test = StringLabelDatasetFromHF(my_test)
data_collator = string_labels_collator_builder(tokenizer)


# LORA fine-tuning
# Define the regex pattern to match desired layers (excluding LayerNorm - ffn.0)
pattern = r"transformer\.blocks\.\d+\.(attn\.layernorm_qkv\.1|attn\.out_proj|ffn\.[13])"

target_modules = [
    name
    for name, module in model_classification.named_modules() # iterate through all modules and their names.
    if re.fullmatch(pattern, name)
]
print(f'Target modules for LORA: {target_modules}')

lora_config = LoraConfig(
    r=4,  # Rank of the LoRA update matrices
    lora_alpha=32,  # Scaling factor for the LoRA update matrices
    lora_dropout=0.05,  # Dropout probability for the LoRA update matrices
    bias="none",  # Whether to apply bias to the LoRA update matrices
    task_type=TaskType.SEQ_CLS,  # Task type for sequence classification
    target_modules=target_modules,  # Modules which LORA method should target and modify their weights
)

model = get_peft_model(model_classification, lora_config)

# Prints the number of trainable parameters in the LoRA-adapted model
model.print_trainable_parameters()

# Define Huggingface Trainer arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy = "epoch",
    logging_strategy = "epoch",
    save_strategy = "epoch",
    learning_rate=3e-4,
    # effective training batch size is batch * accum
    # we recommend an effective training batch size of 8
    per_device_train_batch_size=4,
    per_device_eval_batch_size=16,
    gradient_accumulation_steps=2,
    num_train_epochs=10,
    weight_decay=0.01,
    load_best_model_at_end=True,
    #deepspeed= ds_config if deepspeed else None,
    fp16 = False,
    gradient_checkpointing=False,
)

# Metric definition for validation data
def compute_metrics(eval_pred, num_labels=3):
  if num_labels>1:  # for classification
    metric = load("accuracy")
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
  else:  # for regression
    metric = load("spearmanr")
    predictions, labels = eval_pred

  return metric.compute(predictions=predictions, references=labels)


# Define the Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=torchdataset_my_train,
    eval_dataset=torchdataset_my_valid,
    data_collator=data_collator,  # the custom data collator
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

This is the error I got:

ValueError                                Traceback (most recent call last)
<ipython-input-28-3435b262f1ae> in <cell line: 0>()
----> 1 trainer.train()

9 frames
/usr/local/lib/python3.11/dist-packages/numpy/core/fromnumeric.py in _wrapit(obj, method, *args, **kwds)
     43     except AttributeError:
     44         wrap = None
---> 45     result = getattr(asarray(obj), method)(*args, **kwds)
     46     if wrap:
     47         if not isinstance(result, mu.ndarray):

ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 2 dimensions. The detected shape was (2, 501) + inhomogeneous part.

I am sorry that we still couldn't resolve the issue! Maybe I am missing something basic or critical, I'm still new to LLMs / Hugging Face api in general.
I can send you a small sample of my data so that you can try yourself if that's OK with you.

Thank you
Dani

lhallee

Synthyra org 5 days ago

Is it happening exactly 1 epoch? This could be an error from the evaluation, likely happening in compute_metrics. I would write a separate one for regression or classification based on your needs, and pass the correct one when needed. The only argument for compute_metrics should be an EvalPrediction. You can type hint it like this

from transformers import EvalPrediction

def compute_metrics(p: EvalPrediction):
    preds, labels = p.predictions, p.label_ids
    # if preds or labels is a tuple you usually need to take the 0th index, I usually add an if statement for this   
    # etc.

# For example

def compute_metrics_regression(p: EvalPrediction):
    """
    Compute various regression metrics for model evaluation.

    Args:
        (p: EvalPrediction): An object containing predictions and label ids.

    Returns:
        dict: A dictionary containing the following metrics:
            - r_squared: Coefficient of determination
            - spearman_rho: Spearman's rank correlation coefficient
            - spear_pval: p-value for Spearman's correlation
            - pearson_rho: Pearson correlation coefficient
            - pear_pval: p-value for Pearson's correlation
            - mse: Mean Squared Error
            - mae: Mean Absolute Error
            - rmse: Root Mean Squared Error
    """
    preds = p.predictions[0] if isinstance(p.predictions, tuple) else p.predictions
    labels = p.label_ids[1] if isinstance(p.label_ids, tuple) else p.label_ids

    logits = np.array(preds).flatten()
    labels = np.array(labels).flatten()

    r2 = r2_score(labels, logits)
    spearman_rho, spear_pval = spearmanr(logits, labels)
    pearson_rho, pear_pval = pearsonr(logits, labels)
    mse = mean_squared_error(labels, logits)
    mae = mean_absolute_error(labels, logits)
    rmse = np.sqrt(mse)

    return {
        'r_squared': round(r2, 5),
        'spearman_rho': round(spearman_rho, 5),
        'spear_pval': round(spear_pval, 5),
        'pearson_rho': round(pearson_rho, 5),
        'pear_pval': round(pear_pval, 5),
        'mse': round(mse, 5),
        'mae': round(mae, 5),
        'rmse': round(rmse, 5),
    }

I also don't think you need the .to_numpy() in your dataset class. That shouldn't be able to run for a list of strings.

I would be happy to look at a small sample of your data, one or a couple example lines is fine if it is sensitive (you can change the column names too). I can just copy what you send several times if I need more samples. Also, if you could send the full traceback I may be able to debug a bit better. Sometimes an IDE will not show you the whole thing, I don't think it did here. Not sure how to fix that though.

It's great that you are new to LLMs and Huggingface! Welcome to the ecosystem. There is definitely a learning curve but once it clicks it is a fantastic resource for research. Don't get discouraged!

Best,
Logan

DaniDubi

3 days ago

Dear Logan,

Many thanks for your inputs and encouragement!

I managed to solve the problem, it was indeed as you pointed a problem in compute_metrics which was called only after the 1st epoch. The returned object was a tuple and I needed to correctly index it to retrieve the logits.
Regarding .to_numpy() - I added it as my input in this case was a DataFrame, which was converted to a Series in StringLabelDatasetFromHF(), and thus couldn't be indexed with [ ].
Now the training was successfully finished :-)

Previously I already fine-tuned a transformer T5 model of ProtT5_xl ProtTrans. It actually worked great, on a relatively simple protein function classification task.
Then I used only the encoder part (half precision) and it was definitely enough, while minimized resources and time spent.
Maybe you know if it is possible to run the current ESM-C 600M param model in half precision as well?

In any case I wanted to try this new model as it is suppose to be SOTA design, and it's training data suppose to be much more extensive and varied (UniRef, MGnify and JGI, while ProtT5-xl was trained only on UniRef50).
Again many thanks for your time!

Best regards
Dani

lhallee

Synthyra org 2 days ago

No problem, glad it seems to be working!

You can absolutely run the current ESMC models in half precision, but training in half precision can be much less stable. We find that float16 inference offers almost no cost in performance, but the full half precision training can be tricky. You can try mixed precision training with the huggingface trainer, which should offer you a good speed up and memory reduction with a tiny cost to performance.

If you are interested in ProtT5-like models, the ANKH series has the same architecture and is better in about every way. Synthyra offers versions of the encoder-only weights in this collection - just look for the ANKH models.

Yea, its hard to tell where the meta-genomic data will help and hinder. Models trained on old uniref versions like ANKH and ESM2 are still just as good or better in many scenarios.

If you have any other questions feel free to ask here. If not, kindly close the issue. Thanks for your use of our Huggingface model versions, keep an eye out for our own product releases soon!!! We will have a variety of protein annotation systems hitting the market this year.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment