Compatibilized CodeParrot 🦜 (small)

This is the compatibilized version of CodeParrot 🦜 is a GPT-2 model (110M parameters) trained to generate Python code.

The compatibilization is based on the sequential-rationales process formulated by Vafa et.al.

Usage

You can load the CodeParrot model and tokenizer directly in transformers and use Galeras dataset for sampling the model:

from transformers import AutoTokenizer, AutoModelWithLMHead
  
tokenizer = AutoTokenizer.from_pretrained("semeru/compatible-codeparrot-small")
model = AutoModelWithLMHead.from_pretrained("semeru/compatible-codeparrot-small")

df_sampled_code['size'] =  df_sampled_code['ground_truth'].map(lambda code: len(tokenizer(code)['input_ids']))
df_sampled_code['input_ids'] = tokenizer(df_sampled_code['prompt'].tolist())['input_ids']

Training

The model was trained on the cleaned CodeParrot 🦜 dataset with the following settings:

Config Value
Batch size 192
Context size 1024
Training steps 150'000
Gradient accumulation 1
Gradient checkpointing False
Learning rate 5e-4
Weight decay 0.1
Warmup steps 2000
Schedule Cosine

The training was executed on 16 x A100 (40GB) GPUs. This setting amounts to roughly 29 billion tokens.

Performance

We evaluated the model on OpenAI's HumanEval benchmark which consists of programming challenges:

Metric Value
pass@1 3.80%
pass@10 6.57%
pass@100 12.78%

The pass@k metric tells the probability that at least one out of k generations passes the tests.

Resources

Downloads last month
15
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Datasets used to train semeru/compatible-codeparrot-small