metadata

language:
  - multilingual
tags:
  - code
  - sentence embedding
license: mit
datasets:
  - CodeSearchNet
pipeline_tag: feature-extraction

Model Card for CodeCSE

A simple pre-trained model for code and comment sentence embeddings using contrastive learning. This model was pretrained using CodeSearchNet.

Please clone the CodeCSE repository to get GraphCodeBERTForCL and other dependencies to use this pretrained model. https://github.com/emu-se/CodeCSE

Detailed instructions are listed in the repository's README.md. Overall, you will need:

GraphCodeBERT (CodeCSE uses GraphCodeBERT's input format for code)
GraphCodeBERTForCL defined in codecse/codecse

Inference example

NL input example: example_nl.json

{
    "original_string": "", 
    "docstring_tokens": ["Save", "model", "to", "a", "pickle", "located", "at", "path"], 
    "url": "https://github.com/openai/baselines/blob/3301089b48c42b87b396e246ea3f56fa4bfc9678/baselines/deepq/deepq.py#L55-L72"
}

Code snippet to get the embedding of an NL document (link to complete code):

nl_json = load_example("example_nl.json")
batch = prepare_inputs(nl_json, tokenizer, args)
nl_inputs = batch[3]
with torch.no_grad():        
    nl_vec = model(input_ids=nl_inputs, sent_emb="nl")[1]