--- license: apache-2.0 datasets: - HuggingFaceH4/ultrachat_200k - BAAI/Infinity-Instruct - HuggingFaceH4/ultrafeedback_binarized - Intel/orca_dpo_pairs - argilla/OpenHermesPreferences base_model: - Zyphra/Zamba2-1.2B library_name: transformers --- # Model Card for Zamba2-1.2B Zamba2-1.2B-instruct is obtained from Zamba2-1.2B by fine-tuning on instruction-following and chat datasets. Specifically: 1. SFT of the base [Zamba2-1.2B](https://huggingface.co./Zyphra/Zamba2-1.2B) model on [ultrachat_200k](HuggingFaceH4/ultrachat_200k) and [Infinity-Instruct](https://huggingface.co./datasets/BAAI/Infinity-Instruct) 2. DPO of the SFT checkpoint on [ultrafeedback_binarized](https://huggingface.co./datasets/HuggingFaceH4/ultrafeedback_binarized), [orca_dpo_pairs](https://huggingface.co./datasets/Intel/orca_dpo_pairs), and [OpenHermesPreferences](https://huggingface.co./datasets/argilla/OpenHermesPreferences) Zamba2-1.2B-Instruct is a hybrid model composed of state-space ([Mamba2](https://github.com/state-spaces/mamba)) and transformer blocks. ## Quick start ### Prerequisites To download Zamba2-1.2B, clone Zyphra's fork of transformers: 1. `git clone https://github.com/Zyphra/transformers_zamba2.git` 2. `cd transformers_zamba2` 3. Install the repository: `pip install -e .` 4. `pip install accelerate` You can run the model without using the optimized Mamba2 kernels, but it is **not** recommended as it will result in significantly higher latency and memory usage. To run on CPU, please specify `use_mamba_kernels=False` when loading the model using ``AutoModelForCausalLM.from_pretrained``. ### Inference ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Instantiate model and tokenizer tokenizer = AutoTokenizer.from_pretrained("Zyphra/Zamba2-1.2B-instruct") model = AutoModelForCausalLM.from_pretrained("Zyphra/Zamba2-1.2B-instruct", device_map="cuda", torch_dtype=torch.bfloat16) # Format the input as a chat template prompt = "What factors contributed to the fall of the Roman Empire?" sample = [{'role': 'user', 'content': prompt}] chat_sample = tokenizer.apply_chat_template(sample, tokenize=False) # Tokenize input and generate output input_ids = tokenizer(chat_sample, return_tensors='pt', add_special_tokens=False).to("cuda") outputs = model.generate(**input_ids, max_new_tokens=150, return_dict_in_generate=False, output_scores=False, use_cache=True, num_beams=1, do_sample=False) print((tokenizer.decode(outputs[0]))) ``` ## Performance Zamba2-1.2B-Instruct achieves leading instruction-following and multi-turn chat performance for a model of its size and matches strong models significantly larger. For instance, Zamba2-1.2B-Instruct outperforms Gemma2-2B-Instruct, a very strong model over 2x its size.