Please help with "genecorpus_30M_2048_lengths.pkl"?

#232
by jinbo1129 - opened

Hello, thanks for this great tool Genefomer !!!

I try to pretain a new model with my own datasets.

But I do not kown how to generate the file "genecorpus_30M_2048_lengths.pkl".

Is it a dictionay: {
"cell_name" : num_of_genes_expressed
}

# define the trainer
trainer = GeneformerPretrainer(
    model=model,
    args=training_args,
    # pretraining corpus (e.g. https://huggingface.co./datasets/ctheodoris/Genecorpus-30M/tree/main/genecorpus_30M_2048.dataset)
    train_dataset=load_from_disk("genecorpus_30M_2048.dataset"),
    # file of lengths of each example cell (e.g. https://huggingface.co./datasets/ctheodoris/Genecorpus-30M/blob/main/genecorpus_30M_2048_lengths.pkl)
    example_lengths_file="genecorpus_30M_2048_lengths.pkl",
    token_dictionary=token_dictionary,
)

Thanks !!!

Thank you for your question and interest in Geneformer! Please check out this closed discussion: https://huggingface.co./ctheodoris/Geneformer/discussions/61

ctheodoris changed discussion status to closed

Sign up or log in to comment