GPT2 PyCode

This model is a fine-tuned version of the GPT 124M model, specifically adapted for testing purposes in Python code generation. It was trained on a small corpus of 25,000 Python code samples.

Model Description

This project features a GPT (Generative Pre-trained Transformer) language model with 124 million parameters that has been fine-tuned for Python code generation. Unlike larger models like GPT-2 or GPT-3, this is a smaller-scale model designed primarily for testing and experimental purposes.

  • Developed by: Maharnab Saikia
  • Model type: Language model
  • Language(s) (NLP): English
  • License: MIT
  • Finetuned from model: GPT2 124M

Uses

  • Research: Studying the behavior of small-scale language models in code generation tasks
  • Benchmarking: Providing a baseline for comparing different model architectures or training strategies
  • Rapid Prototyping: Quick tests of code generation ideas without the overhead of larger models
  • Education: Demonstrating the principles of fine-tuning language models for specific tasks

Bias, Risks, and Limitations

It's crucial to understand the limitations of this model:

  • Limited knowledge base due to the small training corpus
  • May struggle with complex or specialized Python code
  • Not suitable for production-level code generation tasks
  • Performance will likely be significantly lower than larger, more comprehensively trained models

How to Get Started with the Model

Use the code below to get started with the model.

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch
import re


device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = GPT2Tokenizer.from_pretrained('maharnab/gpt2_pycode')
model = GPT2LMHeadModel.from_pretrained('maharnab/gpt2_pycode')
model.to(device)

prompt = "How to reverse a string in Python."
encoded_input = tokenizer.encode_plus(f"<sos><user>{prompt}</user><assistant>", max_length=20, truncation=True, return_tensors="pt").to(device)

input_ids = encoded_input['input_ids']
attention_mask = encoded_input['attention_mask']

output = model.generate(
    input_ids, 
    max_length=512, 
    num_return_sequences=1, 
    no_repeat_ngram_size=2,
    temperature=0.7,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    attention_mask=attention_mask,
    pad_token_id=tokenizer.pad_token_id
)

generated_code = tokenizer.decode(output[0])
generated_code = re.search(r'<assistant>(.*?)</assistant>', generated_code, re.DOTALL).group(1)

print(f"Prompt: {prompt}\nGenerated Code:\n{generated_code}")

Training Details

Training Data

  • Model: GPT with 124 million parameters
  • Training Data: 25,000 Python code samples
  • Fine-tuning: Adapted specifically for Python code generation tasks

Training Hyperparameters

  • Epochs: 5
  • Batch Size: 8
  • Learning Rate: 5e-5
  • Contex Window: 512

Environmental Impact

Carbon emissions was estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

  • Hardware Type: P100 GPU
  • Hours used: 5
  • Cloud Provider: Kaggle
  • Compute Region: South Asia
  • Carbon Emitted: 1.15

Acknowledgements

This project builds upon the GPT-2 model developed by OpenAI. We acknowledge their groundbreaking work in the field of natural language processing.

Downloads last month
14
Safetensors
Model size
124M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train maharnab/gpt2_pycode