Spaces:
Running
Running
File size: 5,823 Bytes
f432ba9 ffefb14 8ee26fa ffefb14 8ee26fa 52f8cd5 8ee26fa 2572c7c 1e75146 0b86e83 6d0e573 2ae6447 6d0e573 5b1095d 6d0e573 d9fdb7e bfb487b d9fdb7e 6d0e573 5b1095d 2ae6447 5b1095d e3ffb64 d9fdb7e d1605d4 e6844a8 d9fdb7e 5d5b548 d9fdb7e d12d5eb 0b86e83 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
title: README
emoji: π
colorFrom: yellow
colorTo: purple
sdk: static
pinned: false
---
<p>
<img src="https://huggingface.co./datasets/loubnabnl/repo-images/resolve/main/codeparrot2.png" alt="drawing" width="440"/>
</p>
<p>Check the new instruction-tuning resources:</p>
<ul>
<li>
<p>
<b>InstructHumanEval: </b>a variant of HumanEval benchamrk adapted for instruction-tuned models<a
href="https://huggingface.co./datasets/codeparrot/instructhumaneval"
class="underline"> InstructHumanEval</a
>
</p></li>
<li>
<p>
<b>Full Curated CoNaLa: </b>we used UL2 to rewritte more than 590k uncurated intents in CoNaLa dataset<a
href="https://huggingface.co./datasets/codeparrot/conala-mined-curated"
class="underline"> conala-mined-curated</a
>
</p></li>
<li>
<p>
<b>Self-Instruct with StarCoder: </b>we release a selft-instruct dataset generated with StarCoder, as weel as the code we used to build it<a
href="https://huggingface.co./datasets/codeparrot/self-instruct-starcoder"
class="underline"> self-instruct-starcoder</a
>
</p>
<li>
<p>
<b>Models trained on CoNaLa and self-instruct StarCoder: </b>we release a the models we trained on the previous two datasets.
</p>
</li>
<hr>
<p>
This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code. For advanced Code Language Models and
pre-training datasets we recommend checking our work in the <a href="https://huggingface.co./bigcode">BigCode organization</a>. Here you can find:
</p>
<ul>
<li>
<p>
<b>Interactive blog:</b> where we compare different code models and explain how they are trained and evaluated <a
href="https://huggingface.co./spaces/loubnabnl/code-generation-models"
class="underline">Code generation with π€</a
>
</p>
</li>
<li>
<p>
<b>Spaces:</b>
<li> - Code generation with: <a ref="https://huggingface.co./codeparrot/codeparrot" class="underline">CodeParrot (1.5B)</a>, <a href="https://huggingface.co./facebook/incoder-6B" class="underline">InCoder (6B)</a> and <a href="https://github.com/salesforce/CodeGen" class="underline">CodeGen (6B)</a></li>
<li> - Spaces for some code downstream tasks: algorthmic complexity prediction (BigO), code explanation and code generation from english text.</li>
</p>
</li>
<li><b>Models:</b> CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.</li>
<li><b>Metrics:</b> <a ref="https://huggingface.co./spaces/codeparrot/apps_metric" class="underline">APPS metric</a> for the evaluation of code models on <a href="https://huggingface.co./datasets/codeparrot/apps" class="underline">APPS</a> benchmark.</li>
<li><b>Datasets:</b><ul>
<li>1- <a href="https://huggingface.co./datasets/codeparrot/codeparrot-clean" class="underline">codeparrot-clean</a>, dataset on which we trained and evaluated CodeParrot, the splits are available under <a href="https://huggingface.co./datasets/codeparrot/codeparrot-clean-train" class="underline">codeparrot-clean-train</a> and <a href="https://huggingface.co./datasets/codeparrot/codeparrot-clean-valid" class="underline">codeparrot-clean-valid</a>.</li>
<li>2- A more filtered version of codeparrot-clean under <a href="https://huggingface.co./datasets/codeparrot/codeparrot-train-more-filtering" class="underline">codeparrot-train-more-filtering</a> and <a href="https://huggingface.co./datasets/codeparrot/codeparrot-valid-more-filtering" class="underline">codeparrot-train-more-filtering</a>.</li>
<li>3- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under <a href="https://huggingface.co./datasets/codeparrot/codeparrot-train-near-deduplication" class="underline">codeparrot-train-near-deduplication</a> and <a href="https://huggingface.co./datasets/codeparrot/codeparrot-valid-near-deduplication" class="underline">codeparrot-train-near-deduplication</a>.</li>
</li>
<li>4- CodeParrot dataset after both near deduplication and the additional filtering , it's available under <a href="https://huggingface.co./datasets/codeparrot/codeparrot-train-v2-near-dedup" class="underline">codeparrot-train-v2-near-dedup</a> and <a href="https://huggingface.co./datasets/codeparrot/codeparrot-valid-v2-near-dedup" class="underline">codeparrot-valid-v2-near-dedup</a>.</li>
<li>5- <a href="https://huggingface.co./datasets/codeparrot/github-code" class="underline">GitHub-Code</a>, a 1TB dataset of 32 programming languages from GitHub files.</li>
<li>6- <a href="https://huggingface.co./datasets/codeparrot/github-code-clean" class="underline">GitHub-Code-Clean</a>, a cleaner version of GitHub-Code dataset.</li>
<li>7- <a href="https://huggingface.co./datasets/codeparrot/github-jupyter" class="underline">GitHub-Jupyter</a>, a 16.3GB dataset of Jupyter Notebooks from BigQuery GitHub.</li>
<li>8- <a href="https://huggingface.co./datasets/codeparrot/github-jupyter-text-code-pairs" class="underline">github-jupyter-text-code-pairs</a>, a dataset of text and code pairs extracted from Jupyter notebooks, it is a parsed version of github-jupyter dataset.</li>
<li>9- <a href="https://huggingface.co./datasets/codeparrot/apps" class="underline">APPS</a>, a benchmark for code generation with 10000 problems.</li>
<li>10- <a href="https://huggingface.co./datasets/codeparrot/codecomplex" class="underline">CodeComplex</a>, an annotated dataset of 4,200 Java codes and their time complexity.</li>
<li>11- <a href="https://huggingface.co./datasets/codeparrot/xlcost-text-to-code" class="underline">XLCOST-text-to-code</a>, a subset of XLCoST benchmark, for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP.</li>
</ul>
</li>
</ul> |