Spaces:
Runtime error
Runtime error
File size: 3,709 Bytes
2616726 1114a41 742946c 2616726 a342a9a 28f951a a342a9a 2616726 8fd7e3c 1114a41 742946c a342a9a 1114a41 a342a9a 114a369 a342a9a 1114a41 a342a9a 1114a41 a342a9a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 |
A popular evaluation framework for code generation models is the [pass@k](https://huggingface.co./metrics/code_eval) metric on [HumanEval](https://huggingface.co./datasets/openai_humaneval) dataset, which was introduced in [Codex paper](https://arxiv.org/pdf/2107.03374v2.pdf). The dataset includes 164 handwritten programming problems. In the pass@k metric, k code samples are generated per problem, a problem is considered solved if any sample passes the unit tests and the total fraction of problems solved is reported. Below are some examples for the selcted models. For most models, we sample 200 candidate program completions, and compute pass@1, pass@10, and pass@100 using an unbiased sampling estimator. The table below shows the humanEval scores of CodeParrot, InCoder, GPT-neo models, GPT-J and Codex (not open-source). <div align="center"> | Model | pass@1 | pass@10 | pass@100| |-------|--------|---------|---------| |CodeParrot (1.5B) | 3.58% | 8.03% | 14.96% | ||||| |InCoder (6.7B) | 15.2% | 27.8% | 47.00% | ||||| |Codex (25M)| 3.21% | 7.1% | 12.89%| |Codex (300M)| 13.17%| 20.37% | 36.27% | |Codex (12B)| 28.81%| 46.81% | 72.31% | ||||| |GPT-neo (1.5B)| 4.79% | 7.47% | 16.30% | |GPT-J (6B)| 11.62% | 15.74% | 27.74% | </div> To better understand how pass@k metric works, we will illustrate it with some examples. We select 4 problems from the HumanEval dataset and see how the model performs and which code completions pass the unit tests. We will use CodeParrot 🦜 with the three problem below: ```python from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """ ```` ```python from typing import List def separate_paren_groups(paren_string: str) -> List[str]: """ Input to this function is a string containing multiple groups of nested parentheses. Your goal is to separate those group into separate strings and return the list of those. Separate groups are balanced (each open brace is properly closed) and not nested within each other Ignore any spaces in the input string. >>> separate_paren_groups('( ) (( )) (( )( ))') ['()', '(())', '(()())'] """ ```` ```python def truncate_number(number: float) -> float: """ Given a positive floating point number, it can be decomposed into and integer part (largest integer smaller than given number) and decimals (leftover part always smaller than 1). Return the decimal part of the number. >>> truncate_number(3.5) 0.5 """ ```` For each problem, instead of 200 candidate solutions, we will only generate 20 samples for illustration purposes. We use Nucleus sampling with `top-p=0.95` and `temperature=0.2`. For more details about decoding strategies for language generation, we recommend this [blog](https://huggingface.co./blog/how-to-generate). We will compute pass@1, pass@5 and pass@10, each correspending to unit test pass rate when selecting respectively 1, 5 and 10 samples from the candidate solutions. ``` scores ```` If we take a closer look at the unit test results for each candidate solution in the three tasks, we find that only 3 passed the test which corresponds to `1/30 = 0.333`, our pass@1, the scores pass@5 and pass@10 are higher, because the more samples we select from the candidate solutions, the more likely we are to include the correct solution. Without surprise pass@10 is '2/3=0.73': if we select all candidates two tasks out of three get solved. |