Please help with "genecorpus_30M_2048_lengths.pkl"?
#232
by
jinbo1129
- opened
Hello, thanks for this great tool Genefomer !!!
I try to pretain a new model with my own datasets.
But I do not kown how to generate the file "genecorpus_30M_2048_lengths.pkl".
Is it a dictionay: {
"cell_name" : num_of_genes_expressed
}
# define the trainer
trainer = GeneformerPretrainer(
model=model,
args=training_args,
# pretraining corpus (e.g. https://huggingface.co./datasets/ctheodoris/Genecorpus-30M/tree/main/genecorpus_30M_2048.dataset)
train_dataset=load_from_disk("genecorpus_30M_2048.dataset"),
# file of lengths of each example cell (e.g. https://huggingface.co./datasets/ctheodoris/Genecorpus-30M/blob/main/genecorpus_30M_2048_lengths.pkl)
example_lengths_file="genecorpus_30M_2048_lengths.pkl",
token_dictionary=token_dictionary,
)
Thanks !!!
Thank you for your question and interest in Geneformer! Please check out this closed discussion: https://huggingface.co./ctheodoris/Geneformer/discussions/61
ctheodoris
changed discussion status to
closed