Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,69 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
pipeline_tag: mask-generation
|
4 |
+
tags:
|
5 |
+
- biology
|
6 |
+
- metagenomics
|
7 |
+
- Roberta
|
8 |
+
---
|
9 |
+
### Leveraging Large Language Models for Metagenomic Analysis
|
10 |
+
|
11 |
+
**Model Overview:**
|
12 |
+
This model builds on the RoBERTa architecture with a similar approach to our paper titled "Leveraging Large Language Models for Metagenomic Analysis." The model was trained for one epoch on V100 GPUs.
|
13 |
+
|
14 |
+
**Model Architecture:**
|
15 |
+
- **Base Model:** RoBERTa transformer architecture
|
16 |
+
- **Tokenizer:** Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
|
17 |
+
- **Training:** Trained on a diverse dataset of 220 million 400bp fragments from 18k genomes (Bacteria and Archaea))
|
18 |
+
- **Embeddings:** Generates sequence embeddings using mean pooling of hidden states
|
19 |
+
|
20 |
+
|
21 |
+
**Steps to Use the Model:**
|
22 |
+
|
23 |
+
1. **Install KmerTokenizer:**
|
24 |
+
|
25 |
+
2. ```sh
|
26 |
+
pip install git+https://github.com/MsAlEhR/KmerTokenizer.git
|
27 |
+
```
|
28 |
+
3. **Example Code:**
|
29 |
+
```python
|
30 |
+
from KmerTokenizer import KmerTokenizer
|
31 |
+
from transformers import AutoModel
|
32 |
+
import torch
|
33 |
+
|
34 |
+
# Example gene sequence
|
35 |
+
seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"
|
36 |
+
|
37 |
+
# Initialize the tokenizer
|
38 |
+
tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=400)
|
39 |
+
tokenized_output = tokenizer.kmer_tokenize(seq)
|
40 |
+
pad_token_id = 2 # Set pad token ID
|
41 |
+
|
42 |
+
# Create attention mask (1 for tokens, 0 for padding)
|
43 |
+
attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)
|
44 |
+
|
45 |
+
# Convert tokenized output to LongTensor and add batch dimension
|
46 |
+
inputs = torch.tensor([tokenized_output], dtype=torch.long)
|
47 |
+
|
48 |
+
# Load the pre-trained BigBird model
|
49 |
+
model = AutoModel.from_pretrained("MsAlEhR/MetaBERTa-bigbird-gene", output_hidden_states=True)
|
50 |
+
|
51 |
+
# Generate hidden states
|
52 |
+
outputs = model(input_ids=inputs, attention_mask=attention_mask)
|
53 |
+
|
54 |
+
# Get embeddings from the last hidden state
|
55 |
+
embeddings = outputs.hidden_states[-1]
|
56 |
+
|
57 |
+
# Expand attention mask to match the embedding dimensions
|
58 |
+
expanded_attention_mask = attention_mask.unsqueeze(-1)
|
59 |
+
|
60 |
+
# Compute mean sequence embeddings
|
61 |
+
mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)
|
62 |
+
|
63 |
+
```
|
64 |
+
|
65 |
+
**Citation:**
|
66 |
+
For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:
|
67 |
+
> Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. *IEEE SPMB*.
|
68 |
+
>
|
69 |
+
> Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2024. Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences. bioRxiv, pp.2024-07.
|