Loubna ben allal
add files
c9e8e4a
raw
history blame
704 Bytes
[CodeParrot](https://huggingface.co./lvwerra/codeparrot) was trained on **50GB** of Python data from Github repositories: [CodeParrot dataset](https://huggingface.co./datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
- Exact match deduplication
- Filtering
- Average line length < 100
- Maximum line length < 1000
- Alpha numeric characters fraction > 0.25
- Remove auto-generated files (keyword search)
For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).