Suggestions for new DataCollator ???

#114
by jinbo1129 - opened

Thanks for this great tool !!!
I am trying to do a new fine tuning task: given a cell expression to predict mutiple lables.
These predicted lables have fixed length, such as each cell have 23 labels:
[0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

After I prepared my dataset, I try to run just like the gene classification notebook:

trainer = Trainer(
    model=model,
    args=training_args_init,
    data_collator=DataCollatorForGeneClassification(),
    train_dataset=labeled_train_split,
    eval_dataset=labeled_eval_split
)
trainer.train()

There is a error: ValueError: expected sequence of length 23 at dim 1 (got 30)

I noticed that : this error happends at the second elements in the batch, since
the input_ids of the second elements have been padded with additional 7 '0', then
the 'labels' of the second elements also got additional 7 '-100'.

So if I could just do the padding for the features of 'input_ids' but not for 'labels', I think I will solve this problem. Is that right? Please could you give me some suggestions.

Thanks !!!

Thank you for your question and interest in Geneformer! Please note that if you are classification cells, you should use the DataCollatorForCellClassification and model your approach on the example notebook for cell classification or the disease classification example script which includes hyperparameter optimization that we highly recommend. The number of cells must match the number of labels so that each cell has a label, so that's why they are padded together. Geneformer supports multiclass classification (as in the cell annotation application). However, I believe you are instead working on a problem that requires multi-task classification (multiple modes per cell). We are working on developing a framework for this, but in the meantime you could check out some of the examples of multi-task classification for NLP applications. Geneformer is integrated with Huggingface so you could use any of the Huggingface examples for sentence classification for your cell classification and any of the examples for token classification for gene classification. For example:
https://colab.research.google.com/github/zphang/zphang.github.io/blob/master/files/notebooks/Multi_task_Training_with_Transformers_NLP.ipynb

ctheodoris changed discussion status to closed

thanks very much !!! It is a great tool !!!

Sign up or log in to comment