This model is a RoBERTa model trained on a programming language code - WolfSSL + examples of cybersecurity vulnerabilities related to input validation, diffused with the Linux Kernel code. The model is pre-trained to understand the concep of a singleton in the code

The programming language is C/C++, but the actual inference can also use other languages.

Using the model to unmask can be done in the following way

from transformers import pipeline
unmasker = pipeline('fill-mask', model='mstaron/CyBERTa')
unmasker("Hello I'm a <mask> model.")

To obtain the embeddings for downstream task can be done in the following way:

# import the model via the huggingface library
from transformers import AutoTokenizer, AutoModelForMaskedLM

# load the tokenizer and the model for the pretrained SingBERTa
tokenizer = AutoTokenizer.from_pretrained('mstaron/CyBERTa')

# load the model
model = AutoModelForMaskedLM.from_pretrained("mstaron/CyBERTa")

# import the feature extraction pipeline
from transformers import pipeline

# create the pipeline, which will extract the embedding vectors
# the models are already pre-defined, so we do not need to train anything here
features = pipeline(
    "feature-extraction",
    model=model,
    tokenizer=tokenizer, 
    return_tensor = False
)

# extract the features == embeddings
lstFeatures = features('Class HTTP::X1')

# print the first token's embedding [CLS]
# which is also a good approximation of the whole sentence embedding
# the same as using np.mean(lstFeatures[0], axis=0)
lstFeatures[0][0]

In order to use the model, we need to train it on the downstream task.

Downloads last month
2
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.