ctheodoris/Geneformer · Suggestions for the results of in silico perturbation?

Aug 14, 2023

Hello, thanks for this great tool !
I am trying to use this software to perform data analysis on a published article.
The article compared single-cell data from normal samples and disease samples to identify JAK3 targets, and then treated patients with relevant drugs, resulting in improved conditions. Therefore, I want to analyze this data using Geneformer.

First, the article indicates that T cells are the main cause of the disease. Therefore, I input T cells into the geneformer model, using 1 for normal and 0 for disease, to train a cell classification model. The model successfully distinguished between the two groups of cells：

Next, based on the trained model, I randomly knocked out several genes including the JAK3 gene. I observed that the value obtained for JAK3 was not significantly different from other genes, and the best-scoring genes were the most significantly differentially expressed NN genes in the two groups：

Could you please give some suggestions for this experiment to help me analyze the data using the geneformer model?

Best

ctheodoris

Owner Aug 23, 2023

Thank you for your interest in Geneformer! Your fine-tuned model appears to distinguish the normal and disease state well. You can also quantify the accuracy of the predictive performance on test data to ensure the model has not overfit to your training data. If it has overfit, you could consider training for only 1 epoch (if you trained for more than 1) or including data from multiple datasets (if available) to promote generalizability.

Regarding the in silico perturbation, it is expected that genes that are significantly differentially expressed between the classes would contribute to the model's embedding of the two states. However, in our experience, the model does not strictly separate the classes based only on these significantly differentially expressed genes, likely because it has baseline knowledge from the large-scale pretraining that affects its interpretation of each cell state. You could consider taking an unbiased approach of perturbing all genes (genes_to_perturb="all") to determine the genes whose perturbation most significantly shifts the embedding between the start and goal state. If you performed the in silico perturbation on the last embedding layer (emb_layer=0), you could consider testing the second to last layer (emb_layer=-1) as this layer may represent a more general representation of the cell state, while the final layer is more specifically related to the learning objective, in this case predicting the normal vs. disease classes. You could also consider freezing more layers (e.g. freezing 4 and training 2) while fine-tuning the cell classifier to retain more general attention weights from the pretraining phase.

ctheodoris changed discussion status to closed Aug 23, 2023