MsAlEhR commited on
Commit
b7679d3
·
verified ·
1 Parent(s): 8d7dca8

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +69 -0
README.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: mask-generation
4
+ tags:
5
+ - biology
6
+ - metagenomics
7
+ - Roberta
8
+ ---
9
+ ### Leveraging Large Language Models for Metagenomic Analysis
10
+
11
+ **Model Overview:**
12
+ This model builds on the RoBERTa architecture with a similar approach to our paper titled "Leveraging Large Language Models for Metagenomic Analysis." The model was trained for one epoch on V100 GPUs.
13
+
14
+ **Model Architecture:**
15
+ - **Base Model:** RoBERTa transformer architecture
16
+ - **Tokenizer:** Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
17
+ - **Training:** Trained on a diverse dataset of 220 million 400bp fragments from 18k genomes (Bacteria and Archaea))
18
+ - **Embeddings:** Generates sequence embeddings using mean pooling of hidden states
19
+
20
+
21
+ **Steps to Use the Model:**
22
+
23
+ 1. **Install KmerTokenizer:**
24
+
25
+ 2. ```sh
26
+ pip install git+https://github.com/MsAlEhR/KmerTokenizer.git
27
+ ```
28
+ 3. **Example Code:**
29
+ ```python
30
+ from KmerTokenizer import KmerTokenizer
31
+ from transformers import AutoModel
32
+ import torch
33
+
34
+ # Example gene sequence
35
+ seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"
36
+
37
+ # Initialize the tokenizer
38
+ tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=400)
39
+ tokenized_output = tokenizer.kmer_tokenize(seq)
40
+ pad_token_id = 2 # Set pad token ID
41
+
42
+ # Create attention mask (1 for tokens, 0 for padding)
43
+ attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)
44
+
45
+ # Convert tokenized output to LongTensor and add batch dimension
46
+ inputs = torch.tensor([tokenized_output], dtype=torch.long)
47
+
48
+ # Load the pre-trained BigBird model
49
+ model = AutoModel.from_pretrained("MsAlEhR/MetaBERTa-bigbird-gene", output_hidden_states=True)
50
+
51
+ # Generate hidden states
52
+ outputs = model(input_ids=inputs, attention_mask=attention_mask)
53
+
54
+ # Get embeddings from the last hidden state
55
+ embeddings = outputs.hidden_states[-1]
56
+
57
+ # Expand attention mask to match the embedding dimensions
58
+ expanded_attention_mask = attention_mask.unsqueeze(-1)
59
+
60
+ # Compute mean sequence embeddings
61
+ mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)
62
+
63
+ ```
64
+
65
+ **Citation:**
66
+ For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:
67
+ > Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. *IEEE SPMB*.
68
+ >
69
+ > Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2024. Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences. bioRxiv, pp.2024-07.