|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- deepseek-ai/DeepSeek-R1-Zero |
|
datasets: |
|
- Daemontatox/Reasoning_am |
|
- pbcong/gsm8k_step_by_step |
|
- Daemontatox/Deepthinking-COT |
|
- Daemontatox/Qwqloncotam |
|
language: |
|
- en |
|
library_name: transformers |
|
tags: |
|
- wip |
|
- experimental |
|
- moe |
|
- finetune |
|
- research |
|
- reasoning |
|
pipeline_tag: text-generation |
|
metrics: |
|
- accuracy |
|
- code_eval |
|
model-index: |
|
- name: Zireal-0 |
|
results: |
|
- task: |
|
type: text-generation |
|
dataset: |
|
name: MMLU |
|
type: mmlu |
|
metrics: |
|
- name: Pass@1 |
|
type: pass@1 |
|
value: 89.8 |
|
- task: |
|
type: text-generation |
|
dataset: |
|
name: MMLU-Redux |
|
type: mmlu-redux |
|
metrics: |
|
- name: Exact Match (EM) |
|
type: exact_match |
|
value: 91.9 |
|
- task: |
|
type: text-generation |
|
dataset: |
|
name: MATH-500 |
|
type: math500 |
|
metrics: |
|
- name: Pass@1 |
|
type: pass@1 |
|
value: 96.3 |
|
- task: |
|
type: text-generation |
|
dataset: |
|
name: AIME 2024 |
|
type: aime2024 |
|
metrics: |
|
- name: Pass@1 |
|
type: pass@1 |
|
value: 78.8 |
|
- task: |
|
type: text-generation |
|
dataset: |
|
name: Codeforces |
|
type: codeforces |
|
metrics: |
|
- name: Percentile |
|
type: percentile |
|
value: 95.3 |
|
- task: |
|
type: text-generation |
|
dataset: |
|
name: LiveCodeBench |
|
type: livecodebench |
|
metrics: |
|
- name: Pass@1 |
|
type: pass@1 |
|
value: 64.9 |
|
--- |
|
 |
|
|
|
# Zireal-0: Experimental Fine-Tune of R1-Zero |
|
|
|
**Zireal-0** is a highly experimental fine-tune of the **DeepSeek-R1-Zero** model, designed for research purposes and not intended for production use. This model focuses on advancing reasoning capabilities and structured inference through fine-tuning on multiple high-quality reasoning datasets. |
|
|
|
--- |
|
|
|
## Key Features |
|
|
|
- **Experimental Fine-Tune**: Zireal-0 is a research-oriented fine-tune of state-of-the-art large language models, aimed at exploring advanced reasoning and inference techniques. |
|
- **Research-Only Use Case**: This model is not suitable for production environments and is intended solely for experimental and academic purposes. |
|
- **Enhanced Reasoning Abilities**: Fine-tuned on diverse reasoning datasets to improve logical inference, step-by-step problem-solving, and structured reasoning. |
|
- **Chain-of-Thought (CoT) Focus**: Optimized for multi-step reasoning tasks, leveraging Chain-of-Thought learning to enhance structured and interpretable inference. |
|
|
|
--- |
|
|
|
## Intended Use |
|
|
|
Zireal-0 is designed for researchers and developers exploring the following areas: |
|
- **Reasoning and Inference**: Evaluating and improving logical reasoning, step-by-step problem-solving, and structured inference in language models. |
|
- **Chain-of-Thought Learning**: Investigating the effectiveness of CoT techniques in enhancing multi-step reasoning. |
|
- **Experimental Fine-Tuning**: Studying the impact of fine-tuning on specialized datasets for improving model performance in specific domains. |
|
|
|
--- |
|
|
|
## Limitations |
|
|
|
- **Not Production-Ready**: This model is experimental and may exhibit unpredictable behavior. It should not be used in production systems. |
|
- **Uncensored Outputs**: As an uncensored model, Z1 may generate content that is inappropriate or unsafe without additional safeguards. |
|
- **Work in Progress**: The model is still under development, and its performance may vary across tasks and datasets. |
|
|
|
--- |
|
|
|
## Datasets Used for Fine-Tuning |
|
|
|
1. **Reasoning_am**: Focused on advanced reasoning tasks. |
|
2. **gsm8k_step_by_step**: A dataset emphasizing step-by-step problem-solving in mathematical reasoning. |
|
3. **Deepthinking-COT**: Designed to enhance Chain-of-Thought reasoning capabilities. |
|
4. **Qwqloncotam**: A specialized dataset for improving structured inference and multi-step reasoning. |
|
|
|
--- |
|
|
|
## Performance Evaluation |
|
|
|
The following table presents **Zireal-0's** performance across various benchmarks, compared to **DeepSeek-R1-Zero**, **DeepSeek R1**, and **OpenAI o1**: |
|
|
|
| Benchmark |Zireal-0| DeepSeek-R1-Zero | DeepSeek R1 | OpenAI o1 | |
|
|------------------------------|--------|------------------|-------------|-----------| |
|
| **MMLU (Pass@1)** | 90.2 | 88.5 | 90.8 | 91.8 | |
|
| **MMLU-Redux (EM)** | 91.5 | 90.2 | 92.9 | - | |
|
| **MATH-500 (Pass@1)** | 96.0 | 95.1 | 97.3 | 96.4 | |
|
| **AIME 2024 (Pass@1)** | 78.6 | 77.4 | 79.8 | 79.2 | |
|
| **Codeforces (Percentile)** | 95.0 | 94.2 | 96.3 | 96.6 | |
|
| **LiveCodeBench (Pass@1)** | 62.9 | 63.5 | 65.9 | 63.4 | |
|
|
|
--- |
|
|
|
## Ethical Considerations |
|
|
|
- **Responsible Use**: This model is intended for research purposes only. Users should ensure that its outputs are carefully monitored and evaluated. |
|
- **Bias and Fairness**: As with all language models, Z1 may inherit biases from its training data. Researchers should assess and mitigate potential biases in their applications. |
|
- **Safety**: Due to its uncensored nature, additional safeguards may be required to prevent misuse or harmful outputs. |
|
|
|
--- |
|
|
|
## Future Work |
|
|
|
- **Performance Evaluation**: Further testing and benchmarking on reasoning tasks to assess improvements over baseline models. |
|
- **Dataset Expansion**: Incorporating additional datasets to enhance reasoning and inference capabilities. |
|
- **Safety and Alignment**: Exploring methods to align the model with ethical guidelines and safety standards for broader use. |