JingjingZhai commited on
Commit
fd8336b
·
verified ·
1 Parent(s): 890fc2b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +53 -50
README.md CHANGED
@@ -1,51 +1,54 @@
1
- ---
2
- license: apache-2.0
3
- ---
4
-
5
- ## Model Overview
6
-
7
- PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
-
9
- - **[PlantCaduceus_l20](https://huggingface.co/kuleshov-group/PlantCaduceus_l20)**: 20 layers, 384 hidden size, 20M parameters
10
- - **[PlantCaduceus_l24](https://huggingface.co/kuleshov-group/PlantCaduceus_l24)**: 24 layers, 512 hidden size, 40M parameters
11
- - **[PlantCaduceus_l28](https://huggingface.co/kuleshov-group/PlantCaduceus_l28)**: 28 layers, 768 hidden size, 112M parameters
12
- - **[PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)**: 32 layers, 1024 hidden size, 225M parameters
13
-
14
- ## How to use
15
- ```python
16
- from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
17
- import torch
18
- model_path = 'kuleshov-group/PlantCaduceus_l20'
19
- device = "cuda:0" if torch.cuda.is_available() else "cpu"
20
- model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
21
- model.eval()
22
- tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
23
-
24
- sequence = "ATGCGTACGATCGTAG"
25
- encoding = tokenizer.encode_plus(
26
- sequence,
27
- return_tensors="pt",
28
- return_attention_mask=False,
29
- return_token_type_ids=False
30
- )
31
- input_ids = encoding["input_ids"].to(device)
32
- with torch.inference_mode():
33
- outputs = model(input_ids=input_ids, output_hidden_states=True)
34
- ```
35
-
36
- ## Citation
37
- ```bibtex
38
- @article {Zhai2024.06.04.596709,
39
- author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
40
- title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
41
- elocation-id = {2024.06.04.596709},
42
- year = {2024},
43
- doi = {10.1101/2024.06.04.596709},
44
- URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
45
- eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
46
- journal = {bioRxiv}
47
- }
48
- ```
49
-
50
- ## Contact
 
 
 
51
  Jingjing Zhai ([email protected])
 
1
+ ---
2
+ license: apache-2.0
3
+ ---
4
+
5
+ ## Model Overview
6
+
7
+ PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
8
+
9
+ - **[PlantCaduceus_l20](https://huggingface.co/kuleshov-group/PlantCaduceus_l20)**: 20 layers, 384 hidden size, 20M parameters
10
+ - **[PlantCaduceus_l24](https://huggingface.co/kuleshov-group/PlantCaduceus_l24)**: 24 layers, 512 hidden size, 40M parameters
11
+ - **[PlantCaduceus_l28](https://huggingface.co/kuleshov-group/PlantCaduceus_l28)**: 28 layers, 768 hidden size, 112M parameters
12
+ - **[PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)**: 32 layers, 1024 hidden size, 225M parameters
13
+
14
+ **We would highly recommend using the largest model ([PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)) for the zero-shot score estimation.**
15
+
16
+
17
+ ## How to use
18
+ ```python
19
+ from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
20
+ import torch
21
+ model_path = 'kuleshov-group/PlantCaduceus_l20'
22
+ device = "cuda:0" if torch.cuda.is_available() else "cpu"
23
+ model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
24
+ model.eval()
25
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
26
+
27
+ sequence = "ATGCGTACGATCGTAG"
28
+ encoding = tokenizer.encode_plus(
29
+ sequence,
30
+ return_tensors="pt",
31
+ return_attention_mask=False,
32
+ return_token_type_ids=False
33
+ )
34
+ input_ids = encoding["input_ids"].to(device)
35
+ with torch.inference_mode():
36
+ outputs = model(input_ids=input_ids, output_hidden_states=True)
37
+ ```
38
+
39
+ ## Citation
40
+ ```bibtex
41
+ @article {Zhai2024.06.04.596709,
42
+ author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
43
+ title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
44
+ elocation-id = {2024.06.04.596709},
45
+ year = {2024},
46
+ doi = {10.1101/2024.06.04.596709},
47
+ URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
48
+ eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
49
+ journal = {bioRxiv}
50
+ }
51
+ ```
52
+
53
+ ## Contact
54
  Jingjing Zhai ([email protected])