bespokelabs
/

Bespoke-Stratos-7B

Text Generation

Generated from Trainer

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

ryanmarten commited on 12 days ago

Commit

962647b

·

verified ·

1 Parent(s): cf1601b

Update README.md

Files changed (1) hide show

README.md +9 -5

README.md CHANGED Viewed

@@ -24,12 +24,16 @@ This model is a fine-tuned version of [Qwen/Qwen2.5-7B-Instruct](https://hugging
 The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
 It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
-||Bespoke-Stratos-7B|DeepSeek-R1-Distill-Qwen-7B|Qwen2.5-7B-Instruct|
 |---|---|---|---|
-|AIME2024|20.0|55.5|10.0|
-|MATH500|82.0|83.3|74.2|
-|GPQA-Diamond|37.8|49.1|33.3|
-|LiveCodeBench|32.5|37.6|32.9|
 Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
 However, see an improvement, though not at the scale of DeepSeek's distilled model. The reason could be that we used 17k examples, while DeepSeek seems to have used 800k.

 The dataset is derived by distilling DeepSeek-R1 using the data pipeline of Berkeley NovaSky’s Sky-T1 with some modifications. More info in the dataset card at [Bespoke-Stratos-17k](https://huggingface.co/datasets/Bespoke-Stratos-17k).
 It outperforms Qwen-2.5-7B-Instruct on math reasoning benchmarks:
+||Bespoke-Stratos-7B|Qwen2.5-7B-Instruct|DeepSeek-R1-Distill-Qwen-7B|
 |---|---|---|---|
+|AIME2024|20.0|10.0|55.5|
+|MATH500|82.0|74.2|83.3|
+|GPQA-Diamond|37.8|33.3|49.1|
+|LiveCodeBench v2 Easy|71.4|65.9|81.3|
+|LiveCodeBench v2 Medium|25.5|18.9|42.2|
+|LiveCodeBench v2 Hard|1.6|3.3|2.4|
+|LiveCodeBench v2 All|36.1|31.9|46.6|
 Note that the authors of Sky-T1 had [noted](https://github.com/NovaSky-AI/SkyThought/issues/4#issuecomment-2585860004) that they saw little or no improvement in training 7B or 14B models with their data.
 However, see an improvement, though not at the scale of DeepSeek's distilled model. The reason could be that we used 17k examples, while DeepSeek seems to have used 800k.