metadata

{}

Hymba: A Hybrid-head Architecture for Small Language Models

[Slide][Technical Report] !!! This huggingface repo is still under development.

Developed by Deep Learning Efficiency Research (DLER) team at NVIDIA Research.

Hymba: A Novel LM Architecture

Fuse attention heads and SSM heads within the same layer, offering parallel and complementary processing of the same inputs

Introduce meta tokens that are prepended to the input sequences and interact with all subsequent tokens, thus storing important information and alleviating the burden of "forced-to-attend" in attention
Integrate with cross-layer KV sharing and global-local attention to further boost memory and computation efficiency

Hymba: Performance Highlights

Our Hymba-1.5B-Base outperforms all sub-2B public models, e.g., matching Llama 3.2 3B’s commonsense reasoning accuracy, being 3.49× faster, and reducing cache size by 11.7×
More comparisons can be found in our [Technical Report].

Hymba-1.5B: Model Usage

We release our Hymba-1.5B-Base model and offer the instructions to use our model as follows.

Step 1: Environment Setup

Since our model employs FlexAttention, which relies on Pytorch2.5 and other related dependencies, we provide three ways to set up the environment:

[Pip] Install the related packages using our provided requirement.txt:

pip install -r https://huggingface.co./nvidia/Hymba-1.5B/resolve/main/requirements.txt

[Docker] We have prepared a docker image with all of Hymba's dependencies installed. You can download our docker image and start a container using the following commands:

wget http://10.137.9.244:8000/hymba_docker.tar
docker load -i hymba.tar
docker run --security-opt seccomp=unconfined --gpus all -v /home/$USER:/home/$USER -it hymba:v1 bash

[Internal Only] If you are an internal user from NVIDIA and are using the ORD cluster, you can use our prepared sqsh file to apply for an interactive node:

srun -A nvr_lpr_llm --partition interactive --time 4:00:00 --gpus 8 --container-image /lustre/fsw/portfolios/nvr/users/yongganf/docker/megatron_py25.sqsh --container-mounts=$HOME:/home,/lustre:/lustre  --pty bash

Step 2: Chat with Hymba

After setting up the environment, you can use the following script to chat with our Model

from transformers import LlamaTokenizer, AutoModelForCausalLM, AutoTokenizer, AutoModel
from huggingface_hub import login
import torch

login()

# Load LLaMA2's tokenizer
tokenizer = LlamaTokenizer.from_pretrained("meta-llama/Llama-2-7b")
        
# Load Hymba-1.5B
model = AutoModelForCausalLM.from_pretrained("nvidia/Hymba-1.5B", trust_remote_code=True).cuda().to(torch.bfloat16)

# Chat with our model
def chat_with_model(prompt, model, tokenizer, max_length=64):
    inputs = tokenizer(prompt, return_tensors="pt").to('cuda')
    outputs = model.generate(inputs.input_ids, max_length=max_length, do_sample=False, temperature=0.7, use_cache=True)
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

print("Chat with the model (type 'exit' to quit):")
while True:
    print("User:")
    prompt = input()
    if prompt.lower() == "exit":
        break
    
    # Get the model's response
    response = chat_with_model(prompt, model, tokenizer)
    
    print(f"Model: {response}")