--- license: apache-2.0 library_name: transformers tags: - storm - mistral - openchat - RLAIF - reward model language: - en base_model: openchat/openchat-3.5-0106 datasets: - berkeley-nest/Nectar --- # Storm-7B - **Developed by**: [Jie Liu](https://jieliu.site/) \\(^{*1,2}\\), [Zhanhui Zhou](https://scholar.google.com/citations?user=SbACfYQAAAAJ&hl=zh-CN) \\(^{*2}\\), [Jiaheng Liu](https://liujiaheng.github.io/) \\(^{2}\\), [Xingyuan Bu](https://scholar.google.com.hk/citations?user=cqYaRhUAAAAJ&hl=zh-CN) \\(^{2}\\), [Chao Yang](https://scholar.google.com/citations?user=5KRbHPMAAAAJ&hl=zh-CN) \\(^{2}\\), [Han-Sen Zhong](https://scholar.google.com.hk/citations?user=X_ZfX8sAAAAJ&hl=zh-CN) \\(^{\dag 2}\\), [Wanli Ouyang](https://wlouyang.github.io/) \\(^{1,2}\\). - \\(^{1}\\)MMLab, The Chinese University of Hong Kong \\(^{2}\\)Shanghai AI Laboratory - Paper: [Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level](https://arxiv.org/pdf/2406.11817) - Finetuned from the model: [openchat-3.5-0106](https://huggingface.co./openchat/openchat-3.5-0106) - Dataset: [berkeley-nest/Nectar](https://huggingface.co./datasets/berkeley-nest/Nectar) - Reward Model: [Starling-RM-34B](https://huggingface.co./Nexusflow/Starling-RM-34B) Please see our paper for more details. ## Introduction We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the [AlpacaEval 2.0](https://tatsu-lab.github.io/alpaca_eval/) leaderboard. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 **without increasing verbosity**. ## Performance Our 7B model achieves a **50.5%** length-controlled win rate against GPT-4 Preview on AlpacaEval 2.0.
Our model's LC win rate improves over iterations without significantly changing the response length, indicating better alignment with human values without length bias. The final trained model (iteration 3) achieves a 50.5% LC win rate, making it the first open-source model to surpass the baseline model GPT-4 Preview. In addition to regular decoding, we also test beam search and best-of-n sampling on top of our trained model. Beam search over our trained model shows a 5% improvement over regular decoding, Best-of-n sampling with Starling-RM-34B achieves 61.6% LC Win rate and outperforms GPT-4 Omni.
We observe no significant degradation in traditional NLP tasks from the Huggingface Open LLM Leaderboard.
## Uses Our model uses the same chat template as [Openchat-3.5-0106](https://huggingface.co./openchat/openchat-3.5-0106). A sample code snippet for inference using our model is provided below. ```python from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device) tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B") model.eval().requires_grad_(False) def generate_response(prompt): input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) outputs = model.generate( input_ids, max_length=2048, do_sample=True, temperature=1.0, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) response_ids = outputs[0] response_text = tokenizer.decode(response_ids, skip_special_tokens=True) return response_text prompt = "How does a telescope work?" input_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:" response_text = generate_response(input_prompt) print("Response:", response_text) ``` ## Scripts You can reproduce our results on AlphaEval 2.0 using the script provided below. ```bash git clone https://github.com/tatsu-lab/alpaca_eval.git cd alpaca_eval pip install -e . export OPENAI_API_KEY=