--- license: apache-2.0 library_name: transformers tags: - storm - mistral - openchat - RLAIF - reward model language: - en base_model: openchat/openchat-3.5-0106 datasets: - berkeley-nest/Nectar --- # Storm-7B - **Developed by**: [Jie Liu](https://jieliu.site/) \\(^{*1,2}\\), [Zhanhui Zhou](https://scholar.google.com/citations?user=SbACfYQAAAAJ&hl=zh-CN) \\(^{*2}\\), [Jiaheng Liu](https://liujiaheng.github.io/) \\(^{2}\\), [Xingyuan Bu](https://scholar.google.com.hk/citations?user=cqYaRhUAAAAJ&hl=zh-CN) \\(^{2}\\), [Chao Yang](https://scholar.google.com/citations?user=5KRbHPMAAAAJ&hl=zh-CN) \\(^{2}\\), [Han-Sen Zhong](https://scholar.google.com.hk/citations?user=X_ZfX8sAAAAJ&hl=zh-CN) \\(^{\dag 2}\\), [Wanli Ouyang](https://wlouyang.github.io/) \\(^{1,2}\\). - \\(^{1}\\)MMLab, The Chinese University of Hong Kong   \\(^{2}\\)Shanghai AI Laboratory - Paper: [Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level](https://arxiv.org/pdf/2406.11817) - Finetuned from the model: [openchat-3.5-0106](https://huggingface.co./openchat/openchat-3.5-0106) - Dataset: [berkeley-nest/Nectar](https://huggingface.co./datasets/berkeley-nest/Nectar) - Reward Model: [Starling-RM-34B](https://huggingface.co./Nexusflow/Starling-RM-34B) Please see our paper for more details. ## Introduction We released Storm-7B, the first open-source language model comparable to the GPT-4 series on the [AlpacaEval 2.0](https://tatsu-lab.github.io/alpaca_eval/) leaderboard. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 **without increasing verbosity**. ## Performance Our 7B model achieves a **50.5%** length-controlled win rate against GPT-4 Preview on AlpacaEval 2.0.

Our model's LC win rate improves over iterations without significantly changing the response length, indicating better alignment with human values without length bias. The final trained model (iteration 3) achieves a 50.5% LC win rate, making it the first open-source model to surpass the baseline model GPT-4 Preview. In addition to regular decoding, we also test beam search and best-of-n sampling on top of our trained model. Beam search over our trained model shows a 5% improvement over regular decoding, Best-of-n sampling with Starling-RM-34B achieves 61.6% LC Win rate and outperforms GPT-4 Omni.

We observe no significant degradation in traditional NLP tasks from the Huggingface Open LLM Leaderboard.

## Uses Our model uses the same chat template as [Openchat-3.5-0106](https://huggingface.co./openchat/openchat-3.5-0106). A sample code snippet for inference using our model is provided below. ```python from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" model = AutoModelForCausalLM.from_pretrained("jieliu/Storm-7B").to(device) tokenizer = AutoTokenizer.from_pretrained("jieliu/Storm-7B") model.eval().requires_grad_(False) def generate_response(prompt): input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) outputs = model.generate( input_ids, max_length=2048, do_sample=True, temperature=1.0, pad_token_id=tokenizer.pad_token_id, eos_token_id=tokenizer.eos_token_id, ) response_ids = outputs[0] response_text = tokenizer.decode(response_ids, skip_special_tokens=True) return response_text prompt = "How does a telescope work?" input_prompt = f"GPT4 Correct User: {prompt}<|end_of_turn|>GPT4 Correct Assistant:" response_text = generate_response(input_prompt) print("Response:", response_text) ``` ## Scripts You can reproduce our results on AlphaEval 2.0 using the script provided below. ```bash git clone https://github.com/tatsu-lab/alpaca_eval.git cd alpaca_eval pip install -e . export OPENAI_API_KEY= alpaca_eval evaluate_from_model --model_configs 'Storm-7B' ``` ## Limitations Our work has several limitations: (1) We focus on aligning with human preferences but only use GPT-4 as a proxy for human judgment to evaluate language models. (2) We reduce verbosity with a length penalty, though verbosity and length are not necessarily correlated. Future work could train a specific reward model to directly penalize verbosity, replacing the length margin with a verbosity margin, following the standard [MODPO pipeline](https://github.com/ZHZisZZ/modpo). ## Citation ``` @article{liu2024iterative, title = {Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level}, author = {Liu, Jie and Zhou, Zhanhui and Liu, Jiaheng and Bu, Xingyuan and Yang, Chao and Zhong Han-Sen and Ouyang, Wanli}, journal={arXiv preprint arXiv:2406.11817}, year={2024} } @article{zhou2023beyond, title={Beyond one-preference-for-all: Multi-objective direct preference optimization}, author={Zhou, Zhanhui and Liu, Jie and Yang, Chao and Shao, Jing and Liu, Yu and Yue, Xiangyu and Ouyang, Wanli and Qiao, Yu}, journal={arXiv preprint arXiv:2310.03708}, year={2023} } ```