Spaces:
Running
Running
title: README | |
emoji: π | |
colorFrom: yellow | |
colorTo: purple | |
sdk: static | |
pinned: false | |
<p> | |
<img src="https://huggingface.co./datasets/loubnabnl/repo-images/resolve/main/codeparrot2.png" alt="drawing" width="440"/> | |
</p> | |
<p>Check the new instruction-tuning resources:</p> | |
<ul> | |
<li> | |
<p> | |
<b>InstructHumanEval: </b>a variant of HumanEval benchamrk adapted for instruction-tuned models<a | |
href="https://huggingface.co./datasets/codeparrot/instructhumaneval" | |
class="underline"> InstructHumanEval</a | |
> | |
</p></li> | |
<li> | |
<p> | |
<b>Full Curated CoNaLa: </b>we used UL2 to rewritte more than 590k uncurated intents in CoNaLa dataset<a | |
href="https://huggingface.co./datasets/codeparrot/conala-mined-curated" | |
class="underline"> conala-mined-curated</a | |
> | |
</p></li> | |
<li> | |
<p> | |
<b>Self-Instruct with StarCoder: </b>we release a selft-instruct dataset generated with StarCoder, as weel as the code we used to build it<a | |
href="https://huggingface.co./datasets/codeparrot/self-instruct-starcoder" | |
class="underline"> self-instruct-starcoder</a | |
> | |
</p> | |
<li> | |
<p> | |
<b>Models trained on CoNaLa and self-instruct StarCoder: </b>we release a the models we trained on the previous two datasets. | |
</p> | |
</li> | |
<hr> | |
<p> | |
This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code. For advanced Code Language Models and | |
pre-training datasets we recommend checking our work in the <a href="https://huggingface.co./bigcode">BigCode organization</a>. Here you can find: | |
</p> | |
<ul> | |
<li> | |
<p> | |
<b>Interactive blog:</b> where we compare different code models and explain how they are trained and evaluated <a | |
href="https://huggingface.co./spaces/loubnabnl/code-generation-models" | |
class="underline">Code generation with π€</a | |
> | |
</p> | |
</li> | |
<li> | |
<p> | |
<b>Spaces:</b> | |
<li> - Code generation with: <a ref="https://huggingface.co./codeparrot/codeparrot" class="underline">CodeParrot (1.5B)</a>, <a href="https://huggingface.co./facebook/incoder-6B" class="underline">InCoder (6B)</a> and <a href="https://github.com/salesforce/CodeGen" class="underline">CodeGen (6B)</a></li> | |
<li> - Spaces for some code downstream tasks: algorthmic complexity prediction (BigO), code explanation and code generation from english text.</li> | |
</p> | |
</li> | |
<li><b>Models:</b> CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.</li> | |
<li><b>Metrics:</b> <a ref="https://huggingface.co./spaces/codeparrot/apps_metric" class="underline">APPS metric</a> for the evaluation of code models on <a href="https://huggingface.co./datasets/codeparrot/apps" class="underline">APPS</a> benchmark.</li> | |
<li><b>Datasets:</b><ul> | |
<li>1- <a href="https://huggingface.co./datasets/codeparrot/codeparrot-clean" class="underline">codeparrot-clean</a>, dataset on which we trained and evaluated CodeParrot, the splits are available under <a href="https://huggingface.co./datasets/codeparrot/codeparrot-clean-train" class="underline">codeparrot-clean-train</a> and <a href="https://huggingface.co./datasets/codeparrot/codeparrot-clean-valid" class="underline">codeparrot-clean-valid</a>.</li> | |
<li>2- A more filtered version of codeparrot-clean under <a href="https://huggingface.co./datasets/codeparrot/codeparrot-train-more-filtering" class="underline">codeparrot-train-more-filtering</a> and <a href="https://huggingface.co./datasets/codeparrot/codeparrot-valid-more-filtering" class="underline">codeparrot-train-more-filtering</a>.</li> | |
<li>3- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under <a href="https://huggingface.co./datasets/codeparrot/codeparrot-train-near-deduplication" class="underline">codeparrot-train-near-deduplication</a> and <a href="https://huggingface.co./datasets/codeparrot/codeparrot-valid-near-deduplication" class="underline">codeparrot-train-near-deduplication</a>.</li> | |
</li> | |
<li>4- CodeParrot dataset after both near deduplication and the additional filtering , it's available under <a href="https://huggingface.co./datasets/codeparrot/codeparrot-train-v2-near-dedup" class="underline">codeparrot-train-v2-near-dedup</a> and <a href="https://huggingface.co./datasets/codeparrot/codeparrot-valid-v2-near-dedup" class="underline">codeparrot-valid-v2-near-dedup</a>.</li> | |
<li>5- <a href="https://huggingface.co./datasets/codeparrot/github-code" class="underline">GitHub-Code</a>, a 1TB dataset of 32 programming languages from GitHub files.</li> | |
<li>6- <a href="https://huggingface.co./datasets/codeparrot/github-code-clean" class="underline">GitHub-Code-Clean</a>, a cleaner version of GitHub-Code dataset.</li> | |
<li>7- <a href="https://huggingface.co./datasets/codeparrot/github-jupyter" class="underline">GitHub-Jupyter</a>, a 16.3GB dataset of Jupyter Notebooks from BigQuery GitHub.</li> | |
<li>8- <a href="https://huggingface.co./datasets/codeparrot/github-jupyter-text-code-pairs" class="underline">github-jupyter-text-code-pairs</a>, a dataset of text and code pairs extracted from Jupyter notebooks, it is a parsed version of github-jupyter dataset.</li> | |
<li>9- <a href="https://huggingface.co./datasets/codeparrot/apps" class="underline">APPS</a>, a benchmark for code generation with 10000 problems.</li> | |
<li>10- <a href="https://huggingface.co./datasets/codeparrot/codecomplex" class="underline">CodeComplex</a>, an annotated dataset of 4,200 Java codes and their time complexity.</li> | |
<li>11- <a href="https://huggingface.co./datasets/codeparrot/xlcost-text-to-code" class="underline">XLCOST-text-to-code</a>, a subset of XLCoST benchmark, for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP.</li> | |
</ul> | |
</li> | |
</ul> |