File size: 2,024 Bytes
fd59d06 0e34a8e fd59d06 0e34a8e fd59d06 0e34a8e |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 |
---
license: cc-by-nc-sa-4.0
widget:
- text: >-
AGTCGCCGCAACCCACACACGGACGGCTCGACGTGGCGATCTTAGCGGCTCATCCGCCCGGCCTCCCTCGCGCTCGATCGCTACGCAGCCTACGCTCGTTTCGCTCGGTTCGGTGGGTCGCCGATCTGGCGCCACGGCGGCTACCAACGACACCGCGATTGAGAAGGGTGCGTGGCCGTGGAGTCGTGGAGAAACGCCCGCGCGCGCGGGTGCGGCGAGGGACGACGACCGCGTCGTGCGGATCGATTGGCGGGGCAGCTCGGCGCCCCG
tags:
- DNA
- biology
- genomics
datasets:
- zhangtaolab/plant-multi-species-histone-modifications
metrics:
- accuracy
base_model:
- zhangtaolab/plant-dnamamba-BPE
---
# Plant foundation DNA large language models
The plant DNA large language models (LLMs) contain a series of foundation models based on different model architectures, which are pre-trained on various plant reference genomes.
All the models have a comparable model size between 90 MB and 150 MB, BPE tokenizer is used for tokenization and 8000 tokens are included in the vocabulary.
**Developed by:** zhangtaolab
### Model Sources
- **Repository:** [Plant DNA LLMs](https://github.com/zhangtaolab/plant_DNA_LLMs)
- **Manuscript:** [Versatile applications of foundation DNA language models in plant genomes]()
### Architecture
The model is trained based on the State-Space Mamba-130m model with modified tokenizer specific for DNA sequence.
This model is fine-tuned for predicting H3K27ac histone modification.
### How to use
Install the runtime library first:
```bash
pip install transformers
pip install causal-conv1d<=1.2.0
pip install mamba-ssm<2.0.0
```
Since `transformers` library (version < 4.43.0) does not provide a MambaForSequenceClassification function, we wrote a script to train Mamba model for sequence classification.
An inference code can be found in our [GitHub](https://github.com/zhangtaolab/plant_DNA_LLMs).
Note that Plant DNAMamba model requires NVIDIA GPU to run.
### Training data
We use a custom MambaForSequenceClassification script to fine-tune the model.
Detailed training procedure can be found in our manuscript.
#### Hardware
Model was trained on a NVIDIA GTX4090 GPU (24 GB). |