CodeMorph-ModernBERT

概要

CodeMorph-ModernBERT は、コード検索およびコード理解のタスク向けに１からトレーニングした事前学習済みモデルです。本モデルは code-search-net/code_search_net データセットを活用し、コードの意味的な理解を強化するために訓練されています。 最大シーケンス長2048トークン（従来のMicrosoftモデルは512トークン）に対応し、特にPythonコード検索において抜群の性能を発揮します。

アーキテクチャ: ModernBERT ベース
目的: コード検索 / コード理解 / コード補完
トレーニングデータ: CodeSearchNet (全言語)
ライセンス: Apache 2.0

主な特徴

長いシーケンス対応
最大2048トークンのシーケンス処理が可能。長いコードや複雑な関数にも対応します。
高いコード検索性能
Pythonをはじめとする6言語対応のSentencepieceを用いて作成したトークナイザを採用し、従来モデルを大幅に上回る検索精度を実現しています。
専用にトレーニングされたモデル
CodeSearchNetデータセットを活用して1から学習。コード特有の文法やコメントとの関係を深く理解します。

パラメータについて

以下のパラメータで設計しています。

パラメータ名	設定値
vocab_size	50000
hidden_size	768
num_hidden_layers	12
num_attention_heads	12
intermediate_size	3072
max_position_embeddings	2048
type_vocab_size	2
hidden_dropout_prob	0.1
attention_probs_dropout_prob	0.1
local_attention_window	128
rope_theta	160000
local_attention_rope_theta	10000

モデルの使用方法

Hugging Face Transformers ライブラリを利用して、本モデルを簡単にロードできます。（※ Transformers のバージョンは 4.48.0 以上のみ動作します）

簡単な動作例はこちらです

モデルのロード

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")

マスク補完 (fill-mask)

from transformers import pipeline

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("def add_numbers(a, b): return a + [MASK]"))

コード埋め込みの取得

import torch

def get_embedding(text, model, tokenizer, device="cuda"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    if "token_type_ids" in inputs:
        inputs.pop("token_type_ids")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.model(**inputs)
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding

embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)

データセット

本モデルは code-search-net/code_search_net データセットを使用して訓練されました。このデータセットは、複数のプログラミング言語 (Python, Java, JavaScript など) に関するコードスニペットを含んでおり、コード検索タスクに最適です。

評価結果

本モデルは code_x_glue_ct_code_to_text データセットのPythonの部分を用いて評価されました。以下は主要な評価指標です。また実験の詳細についてはこちら　を確認してください。

指標	スコア
MRR (Mean Reciprocal Rank)	0.8172
MAP (Mean Average Precision)	0.8172
R-Precision	0.7501
Recall@10	0.9389
Precision@10	0.8143
NDCG@10	0.8445
F1@10	0.8423

他のモデルとの比較

以下は、CodeMorph-ModernBERT と他の主要なコード検索モデルの比較結果です。

モデル	MRR	MAP	R-Precision
CodeMorph-ModernBERT	0.8172	0.8172	0.7501
microsoft/graphcodebert-base	0.5482	0.5482	0.4458
microsoft/codebert-base-mlm	0.5243	0.5243	0.4378
Salesforce/codet5p-220m-py	0.7512	0.7512	0.6617
Salesforce/codet5-large-ntp-py	0.7846	0.7846	0.7067
Shuu12121/CodeMorph-BERT	0.6851	0.6851	0.5934
Shuu12121/CodeMorph-BERTv2	0.6535	0.6535	0.5543

Code Search モデル評価結果 (google/code_x_glue_tc_nl_code_search_adv データセット Test)

以下に、google/code_x_glue_tc_nl_code_search_adv データセット (Test) を使用した、各種Code Searchモデルの評価結果をまとめます。候補プールサイズは全て100です。また追加実験のコードはこちらです

モデル	MRR	MAP	R-Precision
Shuu12121/CodeMorph-ModernBERT	0.6107	0.6107	0.5038
Salesforce/codet5p-220m-py	0.5037	0.5037	0.3805
Salesforce/codet5-large-ntp-py	0.4872	0.4872	0.3658
microsoft/graphcodebert-base	0.3844	0.3844	0.2764
microsoft/codebert-base-mlm	0.3766	0.3766	0.2683
Shuu12121/CodeMorph-BERTv2	0.3142	0.3142	0.2166
Shuu12121/CodeMorph-BERT	0.2978	0.2978	0.1992

CodeMorph-ModernBERT は、他の CodeBERT や CodeT5 モデルと比較して、より高い検索精度を達成しています。

多言語における評価結果

CodeMorph-ModernBERTは、複数の言語で高いコード検索性能を示しています。以下は、各言語における主要な評価指標（MRR、MAP、R-Precision）の概要です。またこの実験は全データではなく1000件を抽出して行っています.こちらのノートブックをご参照ください。

言語	MRR	MAP	R-Precision
Python	0.8098	0.8098	0.7520
Java	0.6437	0.6437	0.5480
JavaScript	0.5928	0.5928	0.4880
PHP	0.7512	0.7512	0.6710
Ruby	0.7188	0.7188	0.6310
Go	0.5358	0.5358	0.4320

このように、言語によって数値にはばらつきが見られるものの、CodeMorph-ModernBERTは全体として高い検索精度を維持しています。特にPythonやPHPでは顕著な性能向上が確認されています。

また,Salesforce/codet5p-220m-bimodalは以下のようにCodeMorph-ModernBERTよりも全体的に上回っている検索精度ですが,

言語	MRR	MAP	R-Precision
Python	0.8322	0.8322	0.7660
Java	0.8886	0.8886	0.8390
JavaScript	0.7611	0.7611	0.6710
PHP	0.8985	0.8985	0.8530
Ruby	0.7635	0.7635	0.6740
Go	0.8127	0.8127	0.7260

別のデータセットであるgoogle/code_x_glue_tc_nl_code_search_adv データセット (Test)での結果が下記のようにgoogle/code_x_glue_tc_nl_code_search_advにおいてはCodeMorph-ModernBERT が上回っているため,より難しいタスクやPythonでの汎用性においてはCodeMorph-ModernBERTのほうが有利である可能性があると考えられます.

モデル	MRR	MAP	R-Precision
Shuu12121/CodeMorph-ModernBERT	0.6107	0.6107	0.5038
Salesforce/codet5p-220m-bimodal	0.5326	0.5326	0.4208

ライセンス

本モデルは Apache-2.0 ライセンスのもとで提供されます。

連絡先

このモデルで何か質問等がございましたらこちらのメールアドレスまでお願いします [email protected]

CodeMorph-ModernBERT-English-ver

Overview

CodeMorph-ModernBERT is a pre-trained model designed from scratch for code search and code understanding tasks. This model has been trained using the code-search-net/code_search_net dataset to enhance semantic comprehension of code.
It supports a maximum sequence length of 2048 tokens (compared to Microsoft’s conventional models, which support only 512 tokens) and demonstrates outstanding performance, particularly in Python code search.

Architecture: ModernBERT-based
Purpose: Code search / Code understanding / Code completion
Training Data: CodeSearchNet (all languages)
License: Apache 2.0

Key Features

Long Sequence Support
Handles sequences of up to 2048 tokens, making it suitable for long and complex functions.
High Code Search Performance
Uses a SentencePiece-based tokenizer trained on six programming languages, achieving significantly improved search accuracy over previous models.
Specifically Trained Model
Trained from scratch using the CodeSearchNet dataset, enabling deep understanding of programming syntax and comments.

Model Parameters

The model is designed with the following parameters:

Parameter Name	Value
vocab_size	50000
hidden_size	768
num_hidden_layers	12
num_attention_heads	12
intermediate_size	3072
max_position_embeddings	2048
type_vocab_size	2
hidden_dropout_prob	0.1
attention_probs_dropout_prob	0.1
local_attention_window	128
rope_theta	160000
local_attention_rope_theta	10000

How to Use the Model

The model can be easily loaded using the Hugging Face Transformers library.
(Note: Requires Transformers version 4.48.0 or later.)

Example usage is available here

Load the Model

from transformers import AutoModelForMaskedLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Shuu12121/CodeMorph-ModernBERT")
model = AutoModelForMaskedLM.from_pretrained("Shuu12121/CodeMorph-ModernBERT")

Fill-Mask (Code Completion)

from transformers import pipeline

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
print(fill_mask("def add_numbers(a, b): return a + [MASK]"))

Obtain Code Embeddings

import torch

def get_embedding(text, model, tokenizer, device="cuda"):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=256)
    if "token_type_ids" in inputs:
        inputs.pop("token_type_ids")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    outputs = model.model(**inputs)
    embedding = outputs.last_hidden_state[:, 0, :]
    return embedding

embedding = get_embedding("def my_function(): pass", model, tokenizer)
print(embedding.shape)

Dataset

This model has been trained using the code-search-net/code_search_net dataset.
The dataset contains code snippets from multiple programming languages (Python, Java, JavaScript, etc.), making it well-suited for code search tasks.

Evaluation Results

The model was evaluated using the code_x_glue_ct_code_to_text dataset, specifically the Python subset.
Key evaluation metrics are shown below.
For further details, refer to this link.

Metric	Score
MRR (Mean Reciprocal Rank)	0.8172
MAP (Mean Average Precision)	0.8172
R-Precision	0.7501
Recall@10	0.9389
Precision@10	0.8143
NDCG@10	0.8445
F1@10	0.8423

Comparison with Other Models

Below is a comparison of CodeMorph-ModernBERT with other leading code search models.

Model	MRR	MAP	R-Precision
CodeMorph-ModernBERT	0.8172	0.8172	0.7501
microsoft/graphcodebert-base	0.5482	0.5482	0.4458
microsoft/codebert-base-mlm	0.5243	0.5243	0.4378
Salesforce/codet5p-220m-py	0.7512	0.7512	0.6617
Salesforce/codet5-large-ntp-py	0.7846	0.7846	0.7067
Shuu12121/CodeMorph-BERT	0.6851	0.6851	0.5934
Shuu12121/CodeMorph-BERTv2	0.6535	0.6535	0.5543

Code Search Model Evaluation Results (google/code_x_glue_tc_nl_code_search_adv Dataset Test)

The following table summarizes the evaluation results of various code search models using the google/code_x_glue_tc_nl_code_search_adv dataset (Test).
The candidate pool size for all evaluations was set to 100.
For additional experiment details, see this link.

Model	MRR	MAP	R-Precision
Shuu12121/CodeMorph-ModernBERT	0.6107	0.6107	0.5038
Salesforce/codet5p-220m-py	0.5037	0.5037	0.3805
Salesforce/codet5-large-ntp-py	0.4872	0.4872	0.3658
microsoft/graphcodebert-base	0.3844	0.3844	0.2764
microsoft/codebert-base-mlm	0.3766	0.3766	0.2683
Shuu12121/CodeMorph-BERTv2	0.3142	0.3142	0.2166
Shuu12121/CodeMorph-BERT	0.2978	0.2978	0.1992

CodeMorph-ModernBERT achieves superior search accuracy compared to other CodeBERT and CodeT5 models.

Evaluation Results Across Multiple Languages

CodeMorph-ModernBERT demonstrates high code search performance across multiple programming languages.
The table below summarizes key evaluation metrics (MRR, MAP, R-Precision) for each language.
(Evaluations were conducted using a sample of 1,000 data points. See this notebook for details.)

Language	MRR	MAP	R-Precision
Python	0.8098	0.8098	0.7520
Java	0.6437	0.6437	0.5480
JavaScript	0.5928	0.5928	0.4880
PHP	0.7512	0.7512	0.6710
Ruby	0.7188	0.7188	0.6310
Go	0.5358	0.5358	0.4320

Additionally, Salesforce/codet5p-220m-bimodal generally outperforms CodeMorph-ModernBERT in terms of search accuracy.

Language	MRR	MAP	R-Precision
Python	0.8322	0.8322	0.7660
Java	0.8886	0.8886	0.8390
JavaScript	0.7611	0.7611	0.6710
PHP	0.8985	0.8985	0.8530
Ruby	0.7635	0.7635	0.6740
Go	0.8127	0.8127	0.7260

However, when evaluated on a different dataset, google/code_x_glue_tc_nl_code_search_adv (Test), CodeMorph-ModernBERT achieved higher scores, as shown below.
This suggests that CodeMorph-ModernBERT may be more advantageous for more challenging tasks and generalization in Python.

Model	MRR	MAP	R-Precision
Shuu12121/CodeMorph-ModernBERT	0.6107	0.6107	0.5038
Salesforce/codet5p-220m-bimodal	0.5326	0.5326	0.4208

License

This model is released under the Apache-2.0 license.

Contact Information

If you have any questions about this model, please contact us at the following email address: [email protected]

Shuu12121
/

CodeMorph-ModernBERT