Suggestions for how to add celltype percentages in the fine tuning task?

#96
by jinbo1129 - opened

hello, thanks for this great tool !!!
I want to do a fine tune task such as "Human PBMC scRNA-seq–based aging clocks" from this paper.
In this paper, the authors use the Cell type proportion as the key features to build the models.
I think GeneFormer is usually based on cell expressions.
Could you please give some suggestions about how to add the Cell type proportion into the finetuning task?

Thanks !!!

Thank you for your interest in Geneformer! The optimal fine-tuning objective depends on your exact question. One way you may approach understanding the differences between samples from donors of different ages is by fine-tuning the model to classify the age of the cells based on their gene expression. To do this you could follow the disease classification example but instead of classifying the label of healthy vs. disease, you would classify the label of age. Then, you could probe the embedding space to determine the genes important for each cell state through the in silico perturbation approach as in the example notebook, again substituting the age instead of disease state.

To incorporate multiple cells, you could add multiple cells together separated by a separator token you would add to the token dictionary. You could then present the multiple cells together to the model, and again fine-tune the model to classify them, this time as a group of cells. This would be analogous to fine-tuning BERT for sentiment analysis of a movie review, let's say, which would have multiple sentences (analogous to multiple cells). Because more genes are detected per cell than there are words per sentence and the model may limit the number of total genes you are presenting to the model, you may instead consider extracting Geneformer cell embeddings for each cell using the full rank value encoding and then presenting the group of cell embeddings to a model on top of Geneformer. Purely looking at the proportion of cell types present in a sample can be accomplished with a simpler approach, but this approach may enable the model to get a better understanding of cell interactions since the group of cells and their embeddings are presented together to the model.

ctheodoris changed discussion status to closed

@ctheodoris , thanks for your suggestion !!!
I am new to tranformers and hugging face. That you mentioned:

This would be analogous to fine-tuning BERT for sentiment analysis of a movie review, let's say, which would have multiple sentences 
(analogous to multiple cells). 
Because more genes are detected per cell than there are words per sentence and the model may limit the number of total genes you are presenting to the model, 
you may instead consider extracting Geneformer cell embeddings for each cell using the full rank value encoding and then presenting the group of cell embeddings to a model on top of Geneformer. 

It is very interesting but I currently have no idea how to implement this method. Could please you provide some examples or materials?

Thank you very much.

Sign up or log in to comment