Spaces:
Runtime error
Runtime error
File size: 1,464 Bytes
c2fead6 19e7282 42c2d33 f91d54c 42c2d33 f91d54c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
A popular evaluation framework for code generation models is the [pass@k](https://huggingface.co./metrics/code_eval) metric on [HumanEval](https://huggingface.co./datasets/openai_humaneval) dataset, which was introduced in [Codex paper](https://arxiv.org/pdf/2107.03374v2.pdf). The dataset includes 164 handwritten programming problems. In the pass@k metric, k code samples are generated per problem, and a problem is considered solved if any sample passes the unit tests and the total fraction of problems solved is reported. In most papers, 200 candidate program completions are sampled, and pass@1, pass@10, and pass@100 are computed using an unbiased sampling estimator. Table 1 below shows the HumanEval scores of CodeParrot, InCoder, PolyCoder, CodeGen and Codex (not open-source). <div align="center"> Model | pass@1 | pass@10 | pass@100| |-------|--------|---------|---------| |CodeParrot (110M) | 3.80% | 6.57% | 12.78% | |CodeParrot (1.5B) | 3.58% | 8.03% | 14.96% | ||||| |InCoder (6.7B) | 15.2% | 27.8% | 47.00% | ||||| |PolyCoder (160M)| 2.13% | 3.35% | 4.88% | |PolyCoder (400M)| 2.96% | 5.29% | 11.59% | |PolyCoder (2.7B)| 5.59% | 9.84% | 17.68% | ||||| |CodeGen-Mono (350M)| 12.76% | 23.11% | 35.19% | |CodeGen-Mono (2.7B)| 23.70% | 36.64% | 57.01% | |CodeGen-Mono (16.1B)| **29.28%** | **49.86%** | **75.00%** | ||||| |Codex (25M)| 3.21% | 7.1% | 12.89%| |Codex (300M)| 13.17%| 20.37% | 36.27% | |Codex (12B)| 28.81%| 46.81% | 72.31% | </div> |