|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- dnagpt/human_genome_GCF_009914755.1 |
|
language: |
|
- en |
|
metrics: |
|
- perplexity |
|
library_name: transformers |
|
tags: |
|
- biology |
|
--- |
|
dna language model trained using gpt2. using human genome data. |
|
|
|
Key features of our dangpt models: |
|
|
|
1. BPE tokenization instead of k-mers (DNABERT, DNABERT2 also use BPE) |
|
2. GPT model, but not bert(DNABERT, GENA_LM) |
|
3. pre-training on the latest T2T human genome assembly |
|
4. details:https://github.com/maris205/dnagpt. includes train/bpe code |
|
|
|
``` |
|
|
|
from transformers import AutoTokenizer, AutoModel |
|
tokenizer = AutoTokenizer.from_pretrained('dnagpt/human_gpt2-v1') |
|
tokenizer.tokenize("GAGCACATTCGCCTGCGTGCGCACTCACACACACGTTCAAAAAGAGTCCATTCGATTCTGGCAGTAG") |
|
#result: [G','AGCAC','ATTCGCC',....] |
|
|
|
model = AutoModel.from_pretrained('dnagpt/human_gpt2-v1') |
|
import torch |
|
dna = "ACGTAGCATCGGATCTATCTATCGACACTTGGTTATCGATCTACGAGCATCTCGTTAGC" |
|
inputs = tokenizer(dna, return_tensors = 'pt')["input_ids"] |
|
hidden_states = model(inputs)[0] # [1, sequence_length, 768] |
|
|
|
# embedding with mean pooling |
|
embedding_mean = torch.mean(hidden_states[0], dim=0) |
|
print(embedding_mean.shape) # expect to be 768 |
|
|
|
# embedding with max pooling |
|
embedding_max = torch.max(hidden_states[0], dim=0)[0] |
|
print(embedding_max.shape) # expect to be 768 |
|
|
|
|
|
|