ryanmarten commited on
Commit
54e8451
·
verified ·
1 Parent(s): 962647b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -24,15 +24,15 @@ This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://hugging
24
  The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
25
  It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
26
 
27
- ||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B|
28
- |---|---|---|---|
29
- |AIME2024|20.0|10.0|55.5|
30
- |MATH500|82.0|74.2|83.3|
31
- |GPQA-Diamond|37.8|33.3|49.1|
32
- |LiveCodeBench v2 Easy|71.4|65.9|81.3|
33
- |LiveCodeBench v2 Medium|25.5|18.9|42.2|
34
- |LiveCodeBench v2 Hard|1.6|3.3|2.4|
35
- |LiveCodeBench v2 All|36.1|31.9|46.6|
36
 
37
 
38
  Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
 
24
  The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
25
  It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
26
 
27
+ ||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B (Ours)|DeepSeek-R1-Distill-Qwen-7B (Reported)|
28
+ |---|---|---|---|---|
29
+ |AIME2024|20.0|10.0|43.3|55.5|
30
+ |MATH500|82.0|74.2|89.4|83.3|
31
+ |GPQA-Diamond|37.8|33.3|44.9|49.1|
32
+ |LiveCodeBench v2 Easy|71.4|65.9|81.3|-|
33
+ |LiveCodeBench v2 Medium|25.5|18.9|42.2|-|
34
+ |LiveCodeBench v2 Hard|1.6|3.3|2.4|-|
35
+ |LiveCodeBench v2 All|36.1|31.9|46.6|-|
36
 
37
 
38
  Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.