based-CodeBERTa-language-id-llm-module_uniVienna

This model is a fine-tuned version of malteklaes/based-CodeBERTa-language-id-llm-module.

Model description and Framework version

RobertaTokenizerFast(name_or_path='malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna', vocab_size=52000, model_max_length=512, is_fast=True, padding_side='right', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'sep_token': '</s>', 'pad_token': '<pad>', 'cls_token': '<s>', 'mask_token': '<mask>'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
    0: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    1: AddedToken("<pad>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    3: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
    4: AddedToken("<mask>", rstrip=False, lstrip=True, single_word=False, normalized=False, special=True),
}
  • complete model-config:
RobertaConfig {
  "_name_or_path": "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna",
  "_num_labels": 7,
  "architectures": [
    "RobertaForSequenceClassification"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "go",
    "1": "java",
    "2": "javascript",
    "3": "php",
    "4": "python",
    "5": "ruby",
    "6": "cpp"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "label2id": {
    "cpp": 6,
    "go": 0,
    "java": 1,
    "javascript": 2,
    "php": 3,
    "python": 4,
    "ruby": 5
  },
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "problem_type": "single_label_classification",
  "torch_dtype": "float32",
  "transformers_version": "4.39.3",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 52000
}

Intended uses & limitations

For a given code, the following programming language can be determined:

  • Go
  • Java
  • Javascript
  • PHP
  • Python
  • Ruby
  • C++

Usage

checkpoint = "malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
modelPOST = AutoTokenizer.from_pretrained(checkpoint)

myPipeline = TextClassificationPipeline(
    model=AutoModelForSequenceClassification.from_pretrained(checkpoint, ignore_mismatched_sizes=True),
    tokenizer=AutoTokenizer.from_pretrained(checkpoint)
)

CODE_TO_IDENTIFY_py = """
def is_prime(n):
    if n <= 1:
        return False
    if n == 2 or n == 3:
        return True
    if n % 2 == 0:
        return False
    max_divisor = int(n ** 0.5)
    for i in range(3, max_divisor + 1, 2):
        if n % i == 0:
            return False
    return True

number = 17
if is_prime(number):
    print(f"{number} is a prime number.")
else:
    print(f"{number} is not a prime number.")

"""

myPipeline(CODE_TO_IDENTIFY_py) # output: [{'label': 'python', 'score': 0.9999967813491821}]

Training and evaluation data

Training-Datasets used

Training procedure

  • machine: GPU T4 (Google Colab)
    • system-RAM: 4.7/12.7 GB (during training)
    • GPU-RAM: 2.8/15.0GB
    • Drive: 69.5/78.5 GB (during training due to complete )
  • trainer.train(): [x/24136 xx:xx < 31:12, 12.92 it/s, Epoch 0.01/1]
    • total 24136 iterations

Training note

  • Although this model is based on the predecessors mentioned above, this model had to be trained from scratch because the config.json and labels of the original model were changed from 6 to 7 programming languages.

Training hyperparameters

The following hyperparameters were used during training (training args):

training_args = TrainingArguments(
    output_dir="./based-CodeBERTa-language-id-llm-module_uniVienna",
    overwrite_output_dir=True,
    num_train_epochs=0.1,
    per_device_train_batch_size=8,
    save_steps=500,
    save_total_limit=2,
)

Training results

  • output:
TrainOutput(global_step=24136, training_loss=0.005988701689750161, metrics={'train_runtime': 1936.0586, 'train_samples_per_second': 99.731, 'train_steps_per_second': 12.467, 'total_flos': 3197518224531456.0, 'train_loss': 0.005988701689750161, 'epoch': 0.1})
Downloads last month
10
Safetensors
Model size
83.5M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna

Finetuned
(1)
this model

Dataset used to train malteklaes/based-CodeBERTa-language-id-llm-module_uniVienna