Spaces:
Running
Running
metadata
title: README
emoji: π
colorFrom: yellow
colorTo: purple
sdk: static
pinned: false
This organization is dedicated to language models for code generation. In particular CodeParrot is a GPT-2 model trained to generate Python code. Here you can find:
-
Interactive blog: where we compare different code models and explain how they are trained and evaluated Code generation with π€
-
Spaces:
- - Code generation with: CodeParrot (1.5B), InCoder (6B) and CodeGen (6B)
- - Spaces for some code downstream tasks: algorthmic complexity prediction (BigO), code explanation and code generation from english text.
- Models: CodeParrot (1.5B) and CodeParrot-small (110M), each repo has different ongoing experiments in the branches.
- Metrics: APPS metric.
- Datasets:
- 1- codeparrot-clean, dataset on which we trained and evaluated CodeParrot, the splits are available under codeparrot-clean-train and codeparrot-clean-valid.
- 2- A more filtered version of codeparrot-clean under codeparrot-train-more-filtering and codeparrot-train-more-filtering.
- 3- CodeParrot dataset after near deduplication since initially only exact match deduplication was performed, it's available under codeparrot-train-near-deduplication and codeparrot-train-near-deduplication.
- 4- CodeParrot dataset after both near deduplication and the additional filtering , it's available under codeparrot-train-v2-near-dedup and codeparrot-valid-v2-near-dedup.
- 5- GitHub-Code, a 1TB dataset of 32 programming languages from GitHub files.
- 6- GitHub-Code-Clean, a cleaner version of GitHub-Code dataset.
- 7- GitHub-Jupyter, a 16.3GB dataset of Jupyter Notebooks from BigQuery GitHub.
- 8- APPS, a benchmark for code generation with 10000 problems.
- 9- CodeComplex, an annotated dataset of 4,200 Java codes and their time complexity.
- 10- XLCOST-text-to-code, a subset of XLCoST benchmark, for text-to-code generation at snippet level and program level for 7 programming languages: Python, C, C#, C++, Java, Javascript and PHP.
- 10- github-jupyter-text-code-pairs, a dataset of text and code pairs extracted from Jupyter notebooks.