dnagpt
/

human_gpt2-v1

Feature Extraction

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

human_gpt2-v1 / README.md

marisming's picture

Update README.md

4c82835 over 1 year ago

|

1.27 kB

	---
	license: apache-2.0
	datasets:
	- dnagpt/human_genome_GCF_009914755.1
	language:
	- en
	metrics:
	- perplexity
	library_name: transformers
	tags:
	- biology
	---
	dna language model trained using gpt2. using human genome data.

	Key features of our dangpt models:

	1. BPE tokenization instead of k-mers (DNABERT, DNABERT2 also use BPE)
	2. GPT model, but not bert(DNABERT, GENA_LM)
	3. pre-training on the latest T2T human genome assembly
	4. details:https://github.com/maris205/dnagpt. includes train/bpe code

	```

	from transformers import AutoTokenizer, AutoModel
	tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1')
	tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG")
	#result: [G','AGCAC','ATTCGCC',....]

	model = AutoModel.from_pretrained('dnagpt/human_gpt2-v1')
	import torch
	dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC"
	inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"]
	hidden_states = model(inputs)[0] # [1, sequence_length, 768]

	# embedding with mean pooling
	embedding_mean = torch.mean(hidden_states[0], dim=0)
	print(embedding_mean.shape) # expect to be 768

	# embedding with max pooling
	embedding_max = torch.max(hidden_states[0], dim=0)[0]
	print(embedding_max.shape) # expect to be 768