loubnabnl HF staff commited on
Commit
f4022e4
1 Parent(s): 8a737b0
Files changed (1) hide show
  1. datasets/codegen.txt +1 -1
datasets/codegen.txt CHANGED
@@ -1,6 +1,6 @@
1
  [Codegen](https://huggingface.co/Salesforce/codegen-16B-mono) is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system.
2
 
3
- It was was sequentially trained on three datasets:
4
  - [The Pile](https://huggingface.co/datasets/the_pile)
5
  - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
6
  - 217GB of Python data from Github repositories
 
1
  [Codegen](https://huggingface.co/Salesforce/codegen-16B-mono) is a model for conversational program synthesis, where each problem is interactively solved in multiple steps, each consisting of a natural language specification from the user and a synthesized subprogram from the system.
2
 
3
+ It was sequentially trained on three datasets:
4
  - [The Pile](https://huggingface.co/datasets/the_pile)
5
  - A 341GB subset of Google’s [BigQuery dataset](https://cloud.google.com/blog/topics/public-datasets/github-on-bigquery-analyze-all-the-open-source-code) of code files from multiple programming languages, keeping only 6: C, C++, Go, Java, JavaScript, and Python
6
  - 217GB of Python data from Github repositories