ryanmarten commited on
Commit
962647b
·
verified ·
1 Parent(s): cf1601b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -5
README.md CHANGED
@@ -24,12 +24,16 @@ This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://hugging
24
  The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
25
  It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
26
 
27
- ||Bespoke-Stratos-7B|DeepSeek-R1-Distill-Qwen-7B|Qwen2.5-7B-Instruct|
28
  |---|---|---|---|
29
- |AIME2024|20.0|55.5|10.0|
30
- |MATH500|82.0|83.3|74.2|
31
- |GPQA-Diamond|37.8|49.1|33.3|
32
- |LiveCodeBench|32.5|37.6|32.9|
 
 
 
 
33
 
34
  Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
35
  However, see an improvement, though not at the scale of DeepSeek's distilled model. The reason could be that we used 17k examples, while DeepSeek seems to have used 800k.
 
24
  The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
25
  It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
26
 
27
+ ||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B|
28
  |---|---|---|---|
29
+ |AIME2024|20.0|10.0|55.5|
30
+ |MATH500|82.0|74.2|83.3|
31
+ |GPQA-Diamond|37.8|33.3|49.1|
32
+ |LiveCodeBench v2 Easy|71.4|65.9|81.3|
33
+ |LiveCodeBench v2 Medium|25.5|18.9|42.2|
34
+ |LiveCodeBench v2 Hard|1.6|3.3|2.4|
35
+ |LiveCodeBench v2 All|36.1|31.9|46.6|
36
+
37
 
38
  Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
39
  However, see an improvement, though not at the scale of DeepSeek's distilled model. The reason could be that we used 17k examples, while DeepSeek seems to have used 800k.