ryanmarten
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -24,12 +24,16 @@ This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://hugging
|
|
24 |
The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
|
25 |
It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
|
26 |
|
27 |
-
||Bespoke-Stratos-7B|DeepSeek-R1-Distill-Qwen-7B|
|
28 |
|---|---|---|---|
|
29 |
-
|AIME2024|20.0|55.5|
|
30 |
-
|MATH500|82.0|83.3|
|
31 |
-
|GPQA-Diamond|37.8|49.1|
|
32 |
-
|LiveCodeBench|
|
|
|
|
|
|
|
|
|
33 |
|
34 |
Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
|
35 |
However, see an improvement, though not at the scale of DeepSeek's distilled model. The reason could be that we used 17k examples, while DeepSeek seems to have used 800k.
|
|
|
24 |
The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
|
25 |
It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
|
26 |
|
27 |
+
||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B|
|
28 |
|---|---|---|---|
|
29 |
+
|AIME2024|20.0|10.0|55.5|
|
30 |
+
|MATH500|82.0|74.2|83.3|
|
31 |
+
|GPQA-Diamond|37.8|33.3|49.1|
|
32 |
+
|LiveCodeBench v2 Easy|71.4|65.9|81.3|
|
33 |
+
|LiveCodeBench v2 Medium|25.5|18.9|42.2|
|
34 |
+
|LiveCodeBench v2 Hard|1.6|3.3|2.4|
|
35 |
+
|LiveCodeBench v2 All|36.1|31.9|46.6|
|
36 |
+
|
37 |
|
38 |
Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
|
39 |
However, see an improvement, though not at the scale of DeepSeek's distilled model. The reason could be that we used 17k examples, while DeepSeek seems to have used 800k.
|