Spaces:

codeparrot
/

code-generation-models

Running

App Files Files Community

Loubna ben allal commited on May 22, 2022

Commit

c9e8e4a

1 Parent(s): d490108

add files

Browse files

Files changed (7) hide show

app.py +82 -0
datasets/.ipynb_checkpoints/codeparrot-checkpoint.txt +9 -0
datasets/.ipynb_checkpoints/incoder-checkpoint.txt +0 -0
datasets/.ipynb_checkpoints/opt-checkpoint.txt +2 -0
datasets/codeparrot.txt +9 -0
datasets/incoder.txt +16 -0
datasets/opt.txt +2 -0

app.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import streamlit as st
+from transformers import AutoTokenizer, AutoModelForCausalLM, set_seed
+from transformers import pipeline
+import torch
+import json
+@st.cache(allow_output_mutation=True)
+def load_tokenizer(model_ckpt):
+    return AutoTokenizer.from_pretrained(model_ckpt)
+@st.cache(allow_output_mutation=True)
+def load_model(model_ckpt):
+    model = AutoModelForCausalLM.from_pretrained(model_ckpt, low_cpu_mem_usage=True)
+    return model
+@st.cache()
+def load_examples():
+    with open("examples.json", "r") as f:
+        examples = json.load(f)
+    return examples
+st.set_page_config(page_icon=':parrot:', layout="wide")
+tokenizer1 = load_tokenizer("lvwerra/codeparrot")
+model1 = load_model("lvwerra/codeparrot")
+tokenizer2 = load_tokenizer("facebook/opt-1.3b")
+model2 = load_model("facebook/opt-1.3b")
+tokenizer3 = load_tokenizer("facebook/incoder-1B")
+model3 = load_model("facebook/incoder-1B")
+st.sidebar.header("Models:")
+models = ["CodeParrot", "OPT", "InCoder"]
+selected_models = st.multiselect('Select code generation models to compare',
+                         models,
+                         default=["CodeParrot"])
+st.sidebar.header("Tasks:")
+taks = ["Model architecture", "Model evaluation", "Pretraining dataset", "Prompting"]
+selected_task = st.sidebar.selectbox("Select a task:", tasks, default="Model architecture")
+st.title("Code Generation Models👩‍💻")
+architectures = {}
+datasets = {}
+pipelines = {}
+if selected_task == "Model architecture":
+    st.markdown("## Model architectures")
+    for model in selected_models:
+        with open(f"datasets/{model.lower()}.txt", "r") as f:
+            text = f.read()
+        #architectures[model] = text
+        st.markdown(f"### {model}:")
+        st.markdown(text)
+elif selected_task == "Pretraining dataset":
+    st.markdown("## Pretraining Datasets")
+    for model in selected_models:
+        with open(f"datasets/{model.lower()}.txt", "r") as f:
+            text = f.read()
+        #datasets[model] = text
+        st.markdown(f"### {model}:")
+        st.markdown(text)
+elif selected_task == "Prompting":
+    for model in selected_models:
+        if model == "CodeParrot":
+            tokenizer = load_tokenizer("lvwerra/codeparrot")
+            model = load_model("lvwerra/codeparrot")
+            pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
+            pipelines[model] = pipe
+        elif model == "InCoder":
+            tokenizer = load_tokenizer("facebook/incoder-1B")
+            model = load_model("facebook/incoder-1B")
+            pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
+            pipelines[model] = pipe
+        else:
+            tokenizer = load_tokenizer("facebook/opt-1.3b")
+            model = load_model("facebook/opt-1.3b")
+            pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
+            pipelines[model] = pipe

datasets/.ipynb_checkpoints/codeparrot-checkpoint.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+[CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
+- Exact match deduplication
+- Filtering
+  - Average line length < 100
+  - Maximum line length < 1000
+  - Alpha numeric characters fraction > 0.25
+  - Remove auto-generated files (keyword search)
+For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).

datasets/.ipynb_checkpoints/incoder-checkpoint.txt ADDED Viewed

File without changes

datasets/.ipynb_checkpoints/opt-checkpoint.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [OPT](https://huggingface.co/facebook/opt-30b) was trained on the following 5 filtered datasets of textual documents, one of them includes code, [The Pile](https://arxiv.org/pdf/2101.00027v1.pdf), it used Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews.
2	+ The final training data contains 180B tokens corresponding to 800GB of data. For more details please refer to this [paper](https://arxiv.org/abs/2205.01068)

datasets/codeparrot.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+[CodeParrot](https://huggingface.co/lvwerra/codeparrot) was trained on **50GB** of Python data from Github repositories: [CodeParrot dataset](https://huggingface.co/datasets/lvwerra/codeparrot-clean). The original dataset contains a lot of duplicated and noisy data. Therefore, the dataset was cleaned with the following steps:
+- Exact match deduplication
+- Filtering
+  - Average line length < 100
+  - Maximum line length < 1000
+  - Alpha numeric characters fraction > 0.25
+  - Remove auto-generated files (keyword search)
+For more details see the preprocessing script in the transformers repository [here](https://github.com/huggingface/transformers/tree/master/examples/research_projects/codeparrot).

datasets/incoder.txt ADDED Viewed

	@@ -0,0 +1,16 @@

+[InCoder](https://huggingface.co/facebook/incoder-6B) was trained on  trained on 216 GB of data from Github and Stackoverflow from 28 programming languages. 52 GB rae in Python, 107GB in other programming languages and 57GB is content from stackoverflow that isn't code.
+The Github data used the following filtering:
+- Average line length < 100
+- Maximum line length < 3000
+- Alphanumeric characters fraction > 0.4
+- Remove auto-generated files (keyword search)
+The second componenet of the data consists of questions, answers, and comments from StackOverflow, it includes:
+- all questions that have at least one answer
+- up to ten answers with a non-negative score (sorted
+by score) per question
+- up to five comments per question/answer
+Exact match deduplication was performed in code files.
+For more details please refer to this [paper](https://arxiv.org/pdf/2204.05999.pdf).

datasets/opt.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [OPT](https://huggingface.co/facebook/opt-30b) was trained on the following 5 filtered datasets of textual documents, one of them includes code, [The Pile](https://arxiv.org/pdf/2101.00027v1.pdf), it used Pile-CC, OpenWebText2, USPTO, Project Gutenberg, OpenSubtitles, Wikipedia, DM Mathematics and HackerNews.
2	+ The final training data contains 180B tokens corresponding to 800GB of data. For more details please refer to this [paper](https://arxiv.org/abs/2205.01068)