Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
@@ -11,4 +11,17 @@ pinned: false
|
|
11 |
</p>
|
12 |
|
13 |
|
14 |
-
This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
</p>
|
12 |
|
13 |
|
14 |
+
This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code.
|
15 |
+
|
16 |
+
## Table of contents:
|
17 |
+
|
18 |
+
* Interactive blog: [Code generation with 🤗](https://huggingface.co/spaces/loubnabnl/code-generation-models), where we compare different code models and explain how they are trained and evaluated.
|
19 |
+
* Spaces: code generation with: [CodeParrot](https://huggingface.co/codeparrot/codeparrot) (1.5B), [InCoder](https://huggingface.co/facebook/incoder-6B) (6B) and [CodeGen](https://github.com/salesforce/CodeGen) (6B)
|
20 |
+
* Models: CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.
|
21 |
+
* Datasets:
|
22 |
+
* [codeparrot-clean](https://huggingface.co/datasets/codeparrot/codeparrot-clean), dataset on which we trained and evaluated CodeParrot, the splits are available under [codeparrot-clean-train](https://huggingface.co/datasets/codeparrot/codeparrot-clean-train) and [codeparrot-clean-valid](https://huggingface.co/datasets/codeparrot/codeparrot-clean-valid).
|
23 |
+
* A more filtered version of codeparrot-clean under [codeparrot-train-more-filtering](https://huggingface.co/datasets/codeparrot/codeparrot-train-more-filtering) and [codeparrot-train-more-filtering](https://huggingface.co/datasets/codeparrot/codeparrot-valid-more-filtering).
|
24 |
+
* CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under [codeparrot-train-near-deduplication](https://huggingface.co/datasets/codeparrot/codeparrot-train-near-deduplication) and [codeparrot-train-near-deduplication](https://huggingface.co/datasets/codeparrot/codeparrot-valid-near-deduplication).
|
25 |
+
* [GitHub-Code](https://huggingface.co/datasets/codeparrot/github-code), a 1TB dataset of 32 programming languages with 60 from GitHub files.
|
26 |
+
* [GitHub-Jupyter](https://huggingface.co/datasets/codeparrot/github-jupyter), a 16.3GB dataset of Jupyter Notebooks from BigQuery GitHub.
|
27 |
+
* [APPS](https://huggingface.co/datasets/codeparrot/apps), a benchmark for code generation with 10000 problems.
|