Congratulating DeepSeek-R1 and Inviting Review of Our Team’s Early Research last year on Similar Ideas
Greetings! We would like to extend my sincere gratitude for your enduring contributions to the open sourcing of LLMs. Your dedication has allowed everyone to further enjoy the benefits that LLMs bring in terms of personal improvement and efficiency enhancement. We were delighted to learn about your latest achievement: the DeepSeek-R1-Zero. This model, trained via reinforcement learning (RL) without the preliminary step of supervised fine-tuning (SFT), has shown remarkable performance in reasoning tasks.
Interestingly, as early as March 2024, my team at Bytedance Seed noticed a similar phenomenon during our early research into RLHF open-source models. Utilizing the Mistral open-source model, We developed Mistral-Plus, which verified our innovative approach of directly applying RL to the base model and completely bypassing SFT. This method not only preserves the base model’s general capabilities but significantly enhances its conversational abilities. Here is our paper published last year (March 2024), "Balancing Enhancement, Harmlessness, and General Capabilities: Enhancing Conversational LLMs with Direct RLHF," (https://arxiv.org/abs/2403.02513) and the model was open-sourced last year on Hugging Face: Mistral-Plus-7B. It garnered attention upon release. (https://huggingface.co./zhengchenphd/Mistral-Plus-7B)
During our research last year, we also discovered further Algorithm optimizations and solutions for directly applying RL to the base model without relying on SFT. One such innovation is dynamically and adaptively extending the output length limit during the RL phase, enabling the generation of more detailed and analytical content. However, this introduced the issue of generating excessive redundant information—a challenge that aligns with your findings in the later stages of the deepseek-R1 project. To address this, we published another paper in June last year, "Mistral-C2F: Coarse to Fine Actor for Analytical and Reasoning Enhancement in RLHF and Effective-Merged LLMs." (https://arxiv.org/abs/2406.08657)
Our Mistral-C2F Coarse-to-fine LLM introduces the "Coarse Actor" Analytical and Reasoning LLM, which incorporates the "Continuous Maximization" training strategy to dynamically extend output length limits. However, since the Coarse Actor can often generate excessive redundant information without adequate termination, we introduced the "Fine Actor" Knowledge Refining LLM as a second step. After the Coarse Actor's output is generated, it is merged with the existing Instruction model through a new strategy called 'Knowledge Residue Merger.' This allows for an optimal integration of detailed analytical reasoning into the existing SFT model. Our findings were published and open-sourced on huggingface in June 2024 (https://huggingface.co./zhengchenphd/Mistral-C2F-7B), receiving notable attention and even featured as a Hugging Face daily paper by AK (akhaliq).
We are passionate about the idea of similar innovations being implemented across various LLMs and application scenarios, contributing to the open-source community. We earnestly hope for more exchanges between our teams to further enhance the development of LLMs. Let’s embrace the AI revolution together!