loubnabnl HF staff commited on
Commit
c3ea8fa
1 Parent(s): 67b7d8f

update datasets

Browse files
Files changed (1) hide show
  1. datasets/incoder.txt +3 -3
datasets/incoder.txt CHANGED
@@ -1,8 +1,8 @@
1
- [InCoder](https://huggingface.co/facebook/incoder-6B) was trained on **216 GB** of data from Github and Stackoverflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
2
 
3
  The Github data used the following filtering:
4
- - Average line length < 100
5
- - Maximum line length < 3000
6
  - Alphanumeric characters fraction > 0.4
7
  - Remove auto-generated files (keyword search)
8
 
 
1
+ [InCoder](https://huggingface.co/facebook/incoder-6B) was trained on **216 GB** of data, after preprocessing, from Github and Stackoverflow from 28 programming languages. 52 GB is in Python, 107GB in other programming languages and 57GB is content from Stackoverflow that isn't code.
2
 
3
  The Github data used the following filtering:
4
+ - Average line length < 100 tokens
5
+ - Maximum line length < 3000 MB
6
  - Alphanumeric characters fraction > 0.4
7
  - Remove auto-generated files (keyword search)
8