File size: 4,165 Bytes

a9cf19c
8575357
a9cf19c
8575357
a9cf19c
8575357
a9cf19c
8575357
a9cf19c
 
8575357
 
a9cf19c
8575357
 
 
a9cf19c
8575357
a9cf19c
8575357
a9cf19c
8575357
 
 
a9cf19c
 
8575357
 
 
a9cf19c
8575357
 
 
a9cf19c
8575357
 
 
a9cf19c
 
8575357
a9cf19c
8575357
a9cf19c
8575357
a9cf19c
8575357
a9cf19c
8575357
 
 
 
a9cf19c
8575357
a9cf19c
8575357
 
 
 
 
a9cf19c
8575357
a9cf19c
8575357
 
 
a9cf19c
8575357
 
a9cf19c
8575357
 
 
 
a9cf19c
8575357
a9cf19c
8575357
 
 
 
 
a9cf19c
8575357
 
 
 
 
 
a9cf19c
8575357
 
 
 
 
 
 
 
 
 
 
a9cf19c
8575357

---
{}
---
# Hymba: A Hybrid-head Architecture for Small Language Models

[[Slide](https://docs.google.com/presentation/d/1uidqBfDy8a149yE1-AKtNnPm1qwa01hp8sOj3_KAoMI/edit#slide=id.g2f73b22dcb8_0_1017)][Technical Report]  **!!! This huggingface repo is still under development.**

Developed by Deep Learning Efficiency Research (DLER) team at NVIDIA Research.


## Hymba: A Novel LM Architecture
- Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs

<div align="center">
<img src="https://huggingface.co./nvidia/Hymba-1.5B/resolve/main/images/module.png" alt="Hymba Module" width="600">
</div>

- Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention

- Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency

<div align="center">
<img src="https://huggingface.co./nvidia/Hymba-1.5B/resolve/main/images/macro_arch.png" alt="Hymba Model" width="600">
</div>


## Hymba: Performance Highlights
- Our Hymba-1.5B-Base outperforms all sub-2B public models, e.g., matching Llama 3.2 3B’s commonsense reasoning accuracy, being 3.49× faster, and reducing cache size by 11.7×
- More comparisons can be found in our [Technical Report].

<div align="center">
<img src="https://huggingface.co./nvidia/Hymba-1.5B/resolve/main/images/performance1.png" alt="Compare with SoTA Small LMs" width="600">
</div>

<div align="center">
<img src="https://huggingface.co./nvidia/Hymba-1.5B/resolve/main/images/performance2.png" alt="Compare with SoTA Small LMs" width="600">
</div>


## Hymba-1.5B: Model Usage

We release our Hymba-1.5B-Base model and offer the instructions to use our model as follows.

### Step 1: Environment Setup

Since our model employs [FlexAttention](https://pytorch.org/blog/flexattention/), which relies on Pytorch2.5 and other related dependencies, we provide three ways to set up the environment:

- **[Pip]** Install the related packages using our provided `requirement.txt`:
```
pip install -r https://huggingface.co./nvidia/Hymba-1.5B/resolve/main/requirements.txt
```

- **[Docker]** We have prepared a docker image with all of Hymba's dependencies installed. You can download our docker image and start a container using the following commands:

```
wget http://10.137.9.244:8000/hymba_docker.tar
docker load -i hymba.tar
docker run --security-opt seccomp=unconfined --gpus all -v /home/$USER:/home/$USER -it hymba:v1 bash
```

- **[Internal Only]** If you are an internal user from NVIDIA and are using the ORD cluster, you can use our prepared `sqsh` file to apply for an interactive node:

   ```
   srun -A nvr_lpr_llm --partition interactive --time 4:00:00 --gpus 8 --container-image /lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25.sqsh --container-mounts=$HOME:/home,/lustre:/lustre  --pty bash
   ```

### Step 2: Chat with Hymba
After setting up the environment, you can use the following script to chat with our Model

```
from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer, AutoModel
from huggingface_hub import login
import torch

login()

# Load LLaMA2's tokenizer
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")
        
# Load Hymba-1.5B
model = AutoModelForCausalLM.from_pretrained("nvidia/Hymba-1.5B", trust_remote_code=True).cuda().to(torch.bfloat16)

# Chat with our model
def chat_with_model(prompt, model, tokenizer, max_length=64):
    inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
    outputs = model.generate(inputs.input_ids, max_length=max_length, do_sample=False, temperature=0.7, use_cache=True)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

print("Chat with the model (type 'exit' to quit):")
while True:
    print("User:")
    prompt = input()
    if prompt.lower() == "exit":
        break
    
    # Get the model's response
    response = chat_with_model(prompt, model, tokenizer)
    
    print(f"Model: {response}")

```