ryanmarten
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -24,15 +24,15 @@ This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://hugging
|
|
24 |
The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
|
25 |
It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
|
26 |
|
27 |
-
||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B|
|
28 |
-
|
29 |
-
|AIME2024|20.0|10.0|55.5|
|
30 |
-
|MATH500|82.0|74.2|83.3|
|
31 |
-
|GPQA-Diamond|37.8|33.3|49.1|
|
32 |
-
|LiveCodeBench v2 Easy|71.4|65.9|81.3
|
33 |
-
|LiveCodeBench v2 Medium|25.5|18.9|42.2
|
34 |
-
|LiveCodeBench v2 Hard|1.6|3.3|2.4
|
35 |
-
|LiveCodeBench v2 All|36.1|31.9|46.6
|
36 |
|
37 |
|
38 |
Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
|
|
|
24 |
The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
|
25 |
It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
|
26 |
|
27 |
+
||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B (Ours)|DeepSeek-R1-Distill-Qwen-7B (Reported)|
|
28 |
+
|---|---|---|---|---|
|
29 |
+
|AIME2024|20.0|10.0|43.3|55.5|
|
30 |
+
|MATH500|82.0|74.2|89.4|83.3|
|
31 |
+
|GPQA-Diamond|37.8|33.3|44.9|49.1|
|
32 |
+
|LiveCodeBench v2 Easy|71.4|65.9|81.3|-|
|
33 |
+
|LiveCodeBench v2 Medium|25.5|18.9|42.2|-|
|
34 |
+
|LiveCodeBench v2 Hard|1.6|3.3|2.4|-|
|
35 |
+
|LiveCodeBench v2 All|36.1|31.9|46.6|-|
|
36 |
|
37 |
|
38 |
Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
|