File size: 8,590 Bytes
65ddf81 d895254 0dd04b3 e9e5531 0dd04b3 01ea1ce e9e5531 e14b391 e9e5531 01ea1ce e9e5531 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 |
---
license: mit
language:
- en
tags:
- biology
- transformers
- bio-transformers
- bert
- bio-bert
- enigma
- bio-enigma
- transfomer-model
- mixture-of-experts
---
# enigma-1.5b
## Model Details
this is a transformer based model trained on DNA seq data, capable of generating new sequences of DNA. It's a 2.5billion parameter model trained till convergence.
It also has one more BERT based model that has 47million parameters, also capable of generating new sequences.
### Model Description
- **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_)
- **License:** MIT
### Model Sources
- **Repository:** [github/enigma-1.5b](https://github.com/shivendrra/enigma-1.5b)
- **Papers**: [Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision](https://arxiv.org/html/2311.02333v2#bib.bib35)
## Uses
Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.
### Direct Use
Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for enbert-47m model.
## Bias, Risks, and Limitations
This model was trained on only around ~500mbs of DNA data and that too per-character level, not sub-word or sequence level like in language models. Which means it would have more precision but limited because of training.
I wasn't able to train it on other datasets for better generalizations because of my technical limits, lack of gpu and good hardware.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Shivendrra/enigma-1.5b")
# generate from the model
from model import Transformer
model = Transformer(vocab_size=vocab_size)
class Generator(Transformer):
def __init__(self, vocab_size):
super().__init__()
self.vocab_size = vocab_size
self.block_size = Transformer.block_size
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=0):
generated_tokens = []
for _ in range(max_new_tokens):
idx_cond = idx[:, -self.block_size:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :]
scaled_logits = logits / temperature
if top_k > 0:
scaled_logits = self._top_k_filtering(scaled_logits, top_k)
probs = F.softmax(scaled_logits, dim=-1)
sampled_idx = torch.multinomial(probs, num_samples=1)
generated_tokens.append(sampled_idx.item())
idx = torch.cat((idx, sampled_idx), dim=1)
return generated_tokens
def _top_k_filtering(self, logits, top_k):
values, indices = torch.topk(logits, top_k, dim=-1)
min_value = values[:, -1].unsqueeze(-1).expand_as(logits)
filtered_logits = torch.where(logits < min_value, torch.ones_like(logits) * -float('inf'), logits)
return filtered_logits
```
## Training Details
### Training Data
Used from this dataset: [human_ref_data](https://huggingface.co./datasets/samchain/human_ref_dna)
Consolidated 8 ~500mb files into one big dataset. I've uploaded the data for the training though.
### Training Procedure
These models were trained to 3k-4k iterations, each. on ~500million letters of DNA, roughly around 500mbs of data. Final losses were around ~0.02 for 47million parameter model and ~0.003 for 2.5billion parameter model. I had saved more data, lot more than this, but couldn't train it more due to technical in-capabilities.
Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
#### Functions:
This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.
```python
def get_batch(split):
# generate a small batch of data of inputs x and targets y
data = train_data if split == 'train' else val_data
ix = torch.randint(len(data) - block_size, (batch_size,))
x = torch.stack([data[i:i+block_size] for i in ix])
y = torch.stack([data[i+1:i+block_size+1] for i in ix])
x, y = x.to(device), y.to(device)
return x, y
@torch.no_grad()
def estimate_loss():
out = {}
model.eval()
for split in ['train', 'val']:
losses = torch.zeros(eval_iters)
for k in range(eval_iters):
X, Y = get_batch(split)
logits, loss = model(X, Y)
losses[k] = loss.item()
out[split] = losses.mean()
model.train()
return out
from model import Transformer
model = Transformer(vocab_size=vocab_size)
m = model.to(device)
n_param = sum(p.numel() for p in m.parameters())/1e6
print(f"{n_param:.2f} million")
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
steps = []
train_losses = []
val_losses = []
for iter in range(max_iters):
if iter % eval_interval == 0 or iter == max_iters - 1:
losses = estimate_loss()
print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
steps.append(iter)
train_losses.append(losses['train'])
val_losses.append(losses['val'])
xb, yb = get_batch('train')
logits, loss = model(xb, yb)
optimizer.zero_grad(set_to_none=True)
loss.backward()
optimizer.step()
torch.save(model.state_dict(), f'enigma_{n_param:.0f}m.pth')
```
#### Training Hyperparameters
Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.
```json
{
"batch_size": 10,
"block_size": 512,
"max_iters": 5000,
"eval_interval": 50,
"learning_rate": 3e-5,
"eval_iters": 100,
"d_model": 384,
"n_head": 12,
"n_layer": 12,
"dropout": 0.2,
"norm_eps": 1e-5
}
```
### Model Architecture and Objective
EnBERT is a 47million parameter model, follows BERT architecture, and has one more layer of masked self-attention layer to predict next tokens.
Engima-2.5b is a transformer model. It has a fairly complex model.
![architecture](https://github.com/shivendrra/enigma-1.5b/blob/main/architecture.png)
#### Encoder Part:
---
It consists two different layers, each followed by their own normalization and dropout layers. Input embeddings along with positional embeddings are fed to the encoder block:
##### Self Attention:
- Each head of self-attention layer is similar to that of used in `grokAI`. Key and Query matrices have biases whereas Value matrix doesn't.
- After implementing `torch.matmul()` on Key and Query, relational positional embeddings are applied to the attention matrix.
- Attention and value matrix are then multiplied using `torch.matmul()`.
- Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer
#### FeedForward:
- Normalized outputs are then passed to position-wise `feedforward` layer, with `expansion_factor` of 5.
- GELU is used as the activation function in this case and two linear layers, one for output and other for input.
- Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.
#### Decoder Part:
---
Consists of three different layers:
##### Masked Attention:
- This layer is similar to the self-attention implemented in encoder part, except it has a triangular mask that forbids tokens to look for the context of next token.
- Rest is all same, relational positional embeddings are applied in the same way, but to the masked attention matrix this time.
- Attention and value matrix are then multiplied using `torch.matmul()`.
- Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer
#### Self-Attention:
- Before this, outputs from encoder layer and masked-attention layer are added together, and then passed to this layer.
- Same as the encoder's unmasked attention layer. Key, Query and Value matrices are created using same technique.
- Finally all the outputs are normalized and passed to dropout layer.
#### FeedForward:
- Normalized outputs are then passed to position-wise `feedforward` layer, with `expansion_factor` of 5.
- GELU is used as the activation function in this case and two linear layers, one for output and other for input.
- Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens. |