File size: 8,590 Bytes

65ddf81
 
d895254
 
 
 
 
 
 
 
 
 
0dd04b3
 
e9e5531
 
 
 
 
 
 
 
 
 
0dd04b3
01ea1ce
e9e5531
 
 
 
e14b391
e9e5531
 
 
01ea1ce
e9e5531

---
license: mit
language:
- en
tags:
- biology
- transformers
- bio-transformers
- bert
- bio-bert
- enigma
- bio-enigma
- transfomer-model
- mixture-of-experts
---

# enigma-1.5b


## Model Details
this is a transformer based model trained on DNA seq data, capable of generating new sequences of DNA. It's a 2.5billion parameter model trained till convergence. 
It also has one more BERT based model that has 47million parameters, also capable of generating new sequences.
### Model Description

- **Developed by:** [Shivendra Singh](https://twitter.com/shivendrra_)
- **License:** MIT

### Model Sources

- **Repository:** [github/enigma-1.5b](https://github.com/shivendrra/enigma-1.5b)
- **Papers**: [Understanding the Natural Language of DNA using Encoder-Decoder Foundation Models with Byte-level Precision](https://arxiv.org/html/2311.02333v2#bib.bib35)

## Uses

Can be used to generate new sequences of DNA on a given input of tokens. Or can be used for further research. Anyway, it's very basic in nature. I'll add more functionalities which includes classification of dna, masked token generation, etc. Maybe even implement MOE techinque in future.
### Direct Use

Load the model and then can be used to generate new sequences, `max_length=512` for 2.5b model and `256` for enbert-47m model.

## Bias, Risks, and Limitations

This model was trained on only around ~500mbs of DNA data and that too per-character level, not sub-word or sequence level like in language models. Which means it would have more precision but limited because of training.
I wasn't able to train it on other datasets for better generalizations because of my technical limits, lack of gpu and good hardware.

## How to Get Started with the Model

Use the code below to get started with the model.

```python
# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("Shivendrra/enigma-1.5b")

# generate from the model
from model import Transformer
model = Transformer(vocab_size=vocab_size)

class Generator(Transformer):
	def __init__(self, vocab_size):
	super().__init__()
	self.vocab_size = vocab_size
	self.block_size = Transformer.block_size

	def generate(self, idx, max_new_tokens, temperature=1.0, top_k=0):
		generated_tokens = []

		for _ in range(max_new_tokens):
		idx_cond = idx[:, -self.block_size:]
		logits, _ = self(idx_cond)
		logits = logits[:, -1, :]
		scaled_logits = logits / temperature

		if top_k > 0:
			scaled_logits = self._top_k_filtering(scaled_logits, top_k)
		probs = F.softmax(scaled_logits, dim=-1)
		sampled_idx = torch.multinomial(probs, num_samples=1)
		generated_tokens.append(sampled_idx.item())
		idx = torch.cat((idx, sampled_idx), dim=1)
		return generated_tokens
	
	def _top_k_filtering(self, logits, top_k):
		values, indices = torch.topk(logits, top_k, dim=-1)
		min_value = values[:, -1].unsqueeze(-1).expand_as(logits)
		filtered_logits = torch.where(logits < min_value, torch.ones_like(logits) * -float('inf'), logits)
		return filtered_logits
```

## Training Details

### Training Data

Used from this dataset: [human_ref_data](https://huggingface.co./datasets/samchain/human_ref_dna)
Consolidated 8 ~500mb files into one big dataset. I've uploaded the data for the training though.

### Training Procedure

These models were trained to 3k-4k iterations, each. on ~500million letters of DNA, roughly around 500mbs of data. Final losses were around ~0.02 for 47million parameter model and ~0.003 for 2.5billion parameter model. I had saved more data, lot more than this, but couldn't train it more due to technical in-capabilities.
Try to train it yourself if possible. `enigma/TrainEnigma` file contains all necessary functions needed to train it, from scratch or pre-train.
#### Functions:
This used a basic training procedure. `get_batch()` generated batches of data, `estimate_loss()` estimates losses and `train()` function is kind of master function, here, calling other functions after each or set iterations.

```python
def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    x, y = x.to(device), y.to(device)

    return x, y

@torch.no_grad()
def estimate_loss():
    out = {}
    model.eval()
    for split in ['train', 'val']:
        losses = torch.zeros(eval_iters)
        for k in range(eval_iters):
            X, Y = get_batch(split)
            logits, loss = model(X, Y)
            losses[k] = loss.item()
        out[split] = losses.mean()
    model.train()
    return out

from model import Transformer
model = Transformer(vocab_size=vocab_size)
m = model.to(device)

n_param = sum(p.numel() for p in m.parameters())/1e6
print(f"{n_param:.2f} million")
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
steps = []
train_losses = []
val_losses = []

for iter in range(max_iters):
  if iter % eval_interval == 0 or iter == max_iters - 1:
    losses = estimate_loss()
    print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")
    steps.append(iter)
    train_losses.append(losses['train'])
    val_losses.append(losses['val'])

  xb, yb = get_batch('train')
  logits, loss = model(xb, yb)
  optimizer.zero_grad(set_to_none=True)
  loss.backward()
  optimizer.step()

torch.save(model.state_dict(), f'enigma_{n_param:.0f}m.pth')
```

#### Training Hyperparameters

Configurations are saved in the `enigma/config-enigma.json` file. Suitable for 2.5b model.

```json
{
  "batch_size": 10,
  "block_size": 512,
  "max_iters": 5000,
  "eval_interval": 50,
  "learning_rate": 3e-5,
  "eval_iters": 100,
  "d_model": 384,
  "n_head": 12,
  "n_layer": 12,
  "dropout": 0.2,
  "norm_eps": 1e-5
}
```

### Model Architecture and Objective

EnBERT is a 47million parameter model, follows BERT architecture, and has one more layer of masked self-attention layer to predict next tokens.
Engima-2.5b is a transformer model. It has a fairly complex model.

![architecture](https://github.com/shivendrra/enigma-1.5b/blob/main/architecture.png)
#### Encoder Part:
---
It consists two different layers, each followed by their own normalization and dropout layers. Input embeddings along with positional embeddings are fed to the encoder block:
##### Self Attention:
- Each head of self-attention layer is similar to that of used in `grokAI`. Key and Query matrices have biases whereas Value matrix doesn't.
- After implementing `torch.matmul()` on Key and Query, relational positional embeddings are applied to the attention matrix.
- Attention and value matrix are then multiplied using `torch.matmul()`.
- Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer

#### FeedForward:
- Normalized outputs are then passed to position-wise `feedforward` layer, with `expansion_factor` of 5. 
- GELU is used as the activation function in this case and two linear layers, one for output and other for input.
- Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.
#### Decoder Part:
---
Consists of three different layers:
##### Masked Attention:
- This layer is similar to the self-attention implemented in encoder part, except it has a triangular mask that forbids tokens to look for the context of next token.
- Rest is all same, relational positional embeddings are applied in the same way, but to the masked attention matrix this time.
- Attention and value matrix are then multiplied using `torch.matmul()`.
- Multi-head attention layer than concatenates all the outputs together and passes them through a linear layer
#### Self-Attention:
- Before this, outputs from encoder layer and masked-attention layer are added together, and then passed to this layer.
- Same as the encoder's unmasked attention layer. Key, Query and Value matrices are created using same technique.
- Finally all the outputs are normalized and passed to dropout layer.

#### FeedForward:
- Normalized outputs are then passed to position-wise `feedforward` layer, with `expansion_factor` of 5. 
- GELU is used as the activation function in this case and two linear layers, one for output and other for input.
- Finally dropout is applied and the outputs that are produced have deep global contextual information about the input tokens.