Spaces:
Runtime error
Runtime error
update
Browse files- evaluation/intro.txt +19 -1
evaluation/intro.txt
CHANGED
@@ -16,7 +16,25 @@ In most papers, 200 candidate program completions are sampled, and pass@1, pass@
|
|
16 |
|GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
|
17 |
|GPT-J (6B)| 11.62% | 15.74% | 27.74% |
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
#### Problem 1:
|
21 |
|
22 |
```python
|
|
|
16 |
|GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% |
|
17 |
|GPT-J (6B)| 11.62% | 15.74% | 27.74% |
|
18 |
|
19 |
+
We can load HumanEval dataset and pass@k metric from the hub:
|
20 |
+
|
21 |
+
```python
|
22 |
+
human_eval = load_dataset("openai_humaneval")
|
23 |
+
code_eval_metric = load_metric("code_eval")
|
24 |
+
```
|
25 |
+
|
26 |
+
We can easily compute the pass@k for a problem that asks for the implementation of a function that sums two integers:
|
27 |
+
|
28 |
+
```python
|
29 |
+
from datasets import load_metric
|
30 |
+
test_cases = ["assert add(2,3)==5"]
|
31 |
+
candidates = [["def add(a,b): return a*b", "def add(a, b): return a+b"]]
|
32 |
+
pass_at_k, results = code_eval_metric.compute(references=test_cases, predictions=candidates, k=[1, 2])
|
33 |
+
print(pass_at_k)
|
34 |
+
{'pass@1': 0.5, 'pass@2': 1.0}
|
35 |
+
```
|
36 |
+
|
37 |
+
To better understand how pass@k metric works, we will illustrate it with some concrete examples. We select two problems from the HumanEval dataset and see how CodeParrot 🦜 (110M) performs and which code completions pass the unit tests of the two problems below:
|
38 |
#### Problem 1:
|
39 |
|
40 |
```python
|