MsAlEhR
/

MetaBerta-400-fragments-18k-genome

Mask Generation

Model card Files Files and versions Community

MsAlEhR commited on Nov 5, 2024

Commit

b7679d3

·

verified ·

1 Parent(s): 8d7dca8

Create README.md

Files changed (1) hide show

README.md +69 -0

README.md ADDED Viewed

	@@ -0,0 +1,69 @@

+---
+license: mit
+pipeline_tag: mask-generation
+tags:
+  - biology
+  - metagenomics
+  - Roberta
+---
+### Leveraging Large Language Models for Metagenomic Analysis
+**Model Overview:**
+This model builds on the RoBERTa architecture with a similar approach to our paper titled "Leveraging Large Language Models for Metagenomic Analysis." The model was trained for one epoch on V100 GPUs.
+**Model Architecture:**
+- **Base Model:** RoBERTa transformer architecture
+- **Tokenizer:** Custom K-mer Tokenizer with k-mer length of 6 and overlapping tokens
+- **Training:** Trained on a diverse dataset of 220 million 400bp fragments from 18k genomes  (Bacteria and Archaea))
+- **Embeddings:** Generates sequence embeddings using mean pooling of hidden states
+**Steps to Use the Model:**
+1. **Install KmerTokenizer:**
+2. ```sh
+   pip install git+https://github.com/MsAlEhR/KmerTokenizer.git
+    ```
+3. **Example Code:**
+   ```python
+    from KmerTokenizer import KmerTokenizer
+    from transformers import AutoModel
+    import torch
+    # Example gene sequence
+    seq = "ATTTTTTTTTTTCCCCCCCCCCCGGGGGGGGATCGATGC"
+    # Initialize the tokenizer
+    tokenizer = KmerTokenizer(kmerlen=6, overlapping=True, maxlen=400)
+    tokenized_output = tokenizer.kmer_tokenize(seq)
+    pad_token_id = 2  # Set pad token ID
+    # Create attention mask (1 for tokens, 0 for padding)
+    attention_mask = torch.tensor([1 if token != pad_token_id else 0 for token in tokenized_output], dtype=torch.long).unsqueeze(0)
+    # Convert tokenized output to LongTensor and add batch dimension
+    inputs = torch.tensor([tokenized_output], dtype=torch.long)
+    # Load the pre-trained BigBird model
+    model = AutoModel.from_pretrained("MsAlEhR/MetaBERTa-bigbird-gene", output_hidden_states=True)
+    # Generate hidden states
+    outputs = model(input_ids=inputs, attention_mask=attention_mask)
+    # Get embeddings from the last hidden state
+    embeddings = outputs.hidden_states[-1]
+    # Expand attention mask to match the embedding dimensions
+    expanded_attention_mask = attention_mask.unsqueeze(-1)
+    # Compute mean sequence embeddings
+    mean_sequence_embeddings = torch.sum(expanded_attention_mask * embeddings, dim=1) / torch.sum(expanded_attention_mask, dim=1)
+   ```
+**Citation:**
+For a detailed overview of leveraging large language models for metagenomic analysis, refer to our paper:
+> Refahi, M.S., Sokhansanj, B.A., & Rosen, G.L. (2023). Leveraging Large Language Models for Metagenomic Analysis. *IEEE SPMB*.
+>
+> Refahi, M., Sokhansanj, B.A., Mell, J.C., Brown, J., Yoo, H., Hearne, G. and Rosen, G., 2024. Scorpio: Enhancing Embeddings to Improve Downstream Analysis of DNA sequences. bioRxiv, pp.2024-07.