metadata
license: apache-2.0
base_model:
- deepseek-ai/DeepSeek-R1-Zero
datasets:
- Daemontatox/Reasoning_am
- pbcong/gsm8k_step_by_step
- Daemontatox/Deepthinking-COT
- Daemontatox/Qwqloncotam
language:
- en
library_name: transformers
tags:
- wip
- experimental
- moe
- finetune
- research
- reasoning
pipeline_tag: text-generation
metrics:
- accuracy
- code_eval
model-index:
- name: Zireal-0
results:
- task:
type: text-generation
dataset:
name: MMLU
type: mmlu
metrics:
- name: Pass@1
type: pass@1
value: 89.8
- task:
type: text-generation
dataset:
name: MMLU-Redux
type: mmlu-redux
metrics:
- name: Exact Match (EM)
type: exact_match
value: 91.9
- task:
type: text-generation
dataset:
name: MATH-500
type: math500
metrics:
- name: Pass@1
type: pass@1
value: 96.3
- task:
type: text-generation
dataset:
name: AIME 2024
type: aime2024
metrics:
- name: Pass@1
type: pass@1
value: 78.8
- task:
type: text-generation
dataset:
name: Codeforces
type: codeforces
metrics:
- name: Percentile
type: percentile
value: 95.3
- task:
type: text-generation
dataset:
name: LiveCodeBench
type: livecodebench
metrics:
- name: Pass@1
type: pass@1
value: 64.9
Zireal-0: Experimental Fine-Tune of R1-Zero
Zireal-0 is a highly experimental fine-tune of the DeepSeek-R1-Zero model, designed for research purposes and not intended for production use. This model focuses on advancing reasoning capabilities and structured inference through fine-tuning on multiple high-quality reasoning datasets.
Key Features
- Experimental Fine-Tune: Zireal-0 is a research-oriented fine-tune of state-of-the-art large language models, aimed at exploring advanced reasoning and inference techniques.
- Research-Only Use Case: This model is not suitable for production environments and is intended solely for experimental and academic purposes.
- Enhanced Reasoning Abilities: Fine-tuned on diverse reasoning datasets to improve logical inference, step-by-step problem-solving, and structured reasoning.
- Chain-of-Thought (CoT) Focus: Optimized for multi-step reasoning tasks, leveraging Chain-of-Thought learning to enhance structured and interpretable inference.
Intended Use
Zireal-0 is designed for researchers and developers exploring the following areas:
- Reasoning and Inference: Evaluating and improving logical reasoning, step-by-step problem-solving, and structured inference in language models.
- Chain-of-Thought Learning: Investigating the effectiveness of CoT techniques in enhancing multi-step reasoning.
- Experimental Fine-Tuning: Studying the impact of fine-tuning on specialized datasets for improving model performance in specific domains.
Limitations
- Not Production-Ready: This model is experimental and may exhibit unpredictable behavior. It should not be used in production systems.
- Uncensored Outputs: As an uncensored model, Z1 may generate content that is inappropriate or unsafe without additional safeguards.
- Work in Progress: The model is still under development, and its performance may vary across tasks and datasets.
Datasets Used for Fine-Tuning
- Reasoning_am: Focused on advanced reasoning tasks.
- gsm8k_step_by_step: A dataset emphasizing step-by-step problem-solving in mathematical reasoning.
- Deepthinking-COT: Designed to enhance Chain-of-Thought reasoning capabilities.
- Qwqloncotam: A specialized dataset for improving structured inference and multi-step reasoning.
Performance Evaluation
The following table presents Zireal-0's performance across various benchmarks, compared to DeepSeek-R1-Zero, DeepSeek R1, and OpenAI o1:
Benchmark | Zireal-0 | DeepSeek-R1-Zero | DeepSeek R1 | OpenAI o1 |
---|---|---|---|---|
MMLU (Pass@1) | 90.2 | 88.5 | 90.8 | 91.8 |
MMLU-Redux (EM) | 91.5 | 90.2 | 92.9 | - |
MATH-500 (Pass@1) | 96.0 | 95.1 | 97.3 | 96.4 |
AIME 2024 (Pass@1) | 78.6 | 77.4 | 79.8 | 79.2 |
Codeforces (Percentile) | 95.0 | 94.2 | 96.3 | 96.6 |
LiveCodeBench (Pass@1) | 62.9 | 63.5 | 65.9 | 63.4 |
Ethical Considerations
- Responsible Use: This model is intended for research purposes only. Users should ensure that its outputs are carefully monitored and evaluated.
- Bias and Fairness: As with all language models, Z1 may inherit biases from its training data. Researchers should assess and mitigate potential biases in their applications.
- Safety: Due to its uncensored nature, additional safeguards may be required to prevent misuse or harmful outputs.
Future Work
- Performance Evaluation: Further testing and benchmarking on reasoning tasks to assess improvements over baseline models.
- Dataset Expansion: Incorporating additional datasets to enhance reasoning and inference capabilities.
- Safety and Alignment: Exploring methods to align the model with ethical guidelines and safety standards for broader use.