yuxiang630 cassanof commited on
Commit
c2ae1a5
1 Parent(s): 914583c

Update README.md (#2)

Browse files

- Update README.md (2b71c8fa83b3a780a67a69e609496cb9b4b74a59)


Co-authored-by: Federico Cassano <[email protected]>

Files changed (1) hide show
  1. README.md +11 -0
README.md CHANGED
@@ -211,3 +211,14 @@ The model also inherits the bias, risks, and limitations from its base StarCoder
211
  - **Model:** [bigcode/starCoder2-15b-instruct-v0.1](https://huggingface.co/bigcode/starcoder2-instruct-15b-v0.1)
212
  - **Code:** [bigcode-project/starcoder2-self-align](https://github.com/bigcode-project/starcoder2-self-align)
213
  - **Dataset:** [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k/)
 
 
 
 
 
 
 
 
 
 
 
 
211
  - **Model:** [bigcode/starCoder2-15b-instruct-v0.1](https://huggingface.co/bigcode/starcoder2-instruct-15b-v0.1)
212
  - **Code:** [bigcode-project/starcoder2-self-align](https://github.com/bigcode-project/starcoder2-self-align)
213
  - **Dataset:** [bigcode/self-oss-instruct-sc2-exec-filter-50k](https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k/)
214
+
215
+ ### Full Data Pipeline
216
+
217
+ Our dataset generation pipeline has several steps. We provide intermediate datasets for every step of the pipeline:
218
+ 1. Original seed dataset filtered from The Stack v1: https://huggingface.co/datasets/bigcode/python-stack-v1-functions-filtered
219
+ 2. Seed dataset filtered using StarCoder2-15B as a judge for removing items with bad docstrings: https://huggingface.co/datasets/bigcode/python-stack-v1-functions-filtered-sc2
220
+ 3. seed -> concepts: https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-concepts
221
+ 4. concepts -> instructions: https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-instructions
222
+ 5. instructions -> response: https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-responses-unfiltered
223
+ 6. Responses filtered by executing them: https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-500k-raw
224
+ 7. Executed responses filtered by deduplicating them (final dataset): https://huggingface.co/datasets/bigcode/self-oss-instruct-sc2-exec-filter-50k