kuleshov-group
/

PlantCaduceus_l20

Feature Extraction

Transformers

PyTorch

caduceus

custom_code

Model card Files Files and versions Community

JingjingZhai commited on 13 days ago

Commit

fd8336b

verified ·

1 Parent(s): 890fc2b

Update README.md

Browse files

Files changed (1) hide show

README.md +53 -50

README.md CHANGED Viewed

@@ -1,51 +1,54 @@
----
-license: apache-2.0
----
-## Model Overview
-PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
-- **[PlantCaduceus_l20](https://huggingface.co/kuleshov-group/PlantCaduceus_l20)**: 20 layers, 384 hidden size, 20M parameters
-- **[PlantCaduceus_l24](https://huggingface.co/kuleshov-group/PlantCaduceus_l24)**: 24 layers, 512 hidden size, 40M parameters
-- **[PlantCaduceus_l28](https://huggingface.co/kuleshov-group/PlantCaduceus_l28)**: 28 layers, 768 hidden size, 112M parameters
-- **[PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)**: 32 layers, 1024 hidden size, 225M parameters
-## How to use
-```python
-from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
-import torch
-model_path = 'kuleshov-group/PlantCaduceus_l20'
-device = "cuda:0" if torch.cuda.is_available() else "cpu"
-model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
-model.eval()
-tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
-sequence = "ATGCGTACGATCGTAG"
-encoding = tokenizer.encode_plus(
-            sequence,
-            return_tensors="pt",
-            return_attention_mask=False,
-            return_token_type_ids=False
-        )
-input_ids = encoding["input_ids"].to(device)
-with torch.inference_mode():
-    outputs = model(input_ids=input_ids, output_hidden_states=True)
-```
-## Citation
-```bibtex
-@article {Zhai2024.06.04.596709,
-	author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
-	title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
-	elocation-id = {2024.06.04.596709},
-	year = {2024},
-	doi = {10.1101/2024.06.04.596709},
-	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
-	eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
-	journal = {bioRxiv}
-}
-```
-## Contact
 Jingjing Zhai ([email protected])

+---
+license: apache-2.0
+---
+## Model Overview
+PlantCaduceus is a DNA language model pre-trained on 16 Angiosperm genomes. Utilizing the [Caduceus](https://caduceus-dna.github.io/) and [Mamba](https://arxiv.org/abs/2312.00752) architectures and a masked language modeling objective, PlantCaduceus is designed to learn evolutionary conservation and DNA sequence grammar from 16 species spanning a history of 160 million years. We have trained a series of PlantCaduceus models with varying parameter sizes:
+- **[PlantCaduceus_l20](https://huggingface.co/kuleshov-group/PlantCaduceus_l20)**: 20 layers, 384 hidden size, 20M parameters
+- **[PlantCaduceus_l24](https://huggingface.co/kuleshov-group/PlantCaduceus_l24)**: 24 layers, 512 hidden size, 40M parameters
+- **[PlantCaduceus_l28](https://huggingface.co/kuleshov-group/PlantCaduceus_l28)**: 28 layers, 768 hidden size, 112M parameters
+- **[PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)**: 32 layers, 1024 hidden size, 225M parameters
+**We would highly recommend using the largest model ([PlantCaduceus_l32](https://huggingface.co/kuleshov-group/PlantCaduceus_l32)) for the zero-shot score estimation.**
+## How to use
+```python
+from transformers import AutoModel, AutoModelForMaskedLM, AutoTokenizer
+import torch
+model_path = 'kuleshov-group/PlantCaduceus_l20'
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+model = AutoModelForMaskedLM.from_pretrained(model_path, trust_remote_code=True, device_map=device)
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
+sequence = "ATGCGTACGATCGTAG"
+encoding = tokenizer.encode_plus(
+            sequence,
+            return_tensors="pt",
+            return_attention_mask=False,
+            return_token_type_ids=False
+        )
+input_ids = encoding["input_ids"].to(device)
+with torch.inference_mode():
+    outputs = model(input_ids=input_ids, output_hidden_states=True)
+```
+## Citation
+```bibtex
+@article {Zhai2024.06.04.596709,
+	author = {Zhai, Jingjing and Gokaslan, Aaron and Schiff, Yair and Berthel, Ana and Liu, Zong-Yan and Miller, Zachary R and Scheben, Armin and Stitzer, Michelle C and Romay, Cinta and Buckler, Edward S. and Kuleshov, Volodymyr},
+	title = {Cross-species plant genomes modeling at single nucleotide resolution using a pre-trained DNA language model},
+	elocation-id = {2024.06.04.596709},
+	year = {2024},
+	doi = {10.1101/2024.06.04.596709},
+	URL = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709},
+	eprint = {https://www.biorxiv.org/content/early/2024/06/05/2024.06.04.596709.full.pdf},
+	journal = {bioRxiv}
+}
+```
+## Contact
 Jingjing Zhai ([email protected])