Long Bert Chinese

简体中文 | English


Long Bert: 长文本相似度模型,支持8192token长度。 基于bert-base-chinese,将原始BERT位置编码更改成ALiBi位置编码,使BERT可以支持8192的序列长度。

News

  • 支持CoSENT微调
  • github仓库 github

使用

from numpy.linalg import norm
from transformers import AutoModel

model_path = "OctopusMind/longbert-embedding-8k-zh"
model = AutoModel.from_pretrained(model_path, trust_remote_code=True)

sentences = ['我是问蚂蚁借呗为什么不能提前结清欠款', "为什么借呗不能选择提前还款"]
embeddings = model.encode(sentences)
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
print(cos_sim(embeddings[0], embeddings[1]))

微调

数据格式

[
    {
        "sentence1": "一个男人在吹一支大笛子。",
        "sentence2": "一个人在吹长笛。",
        "label": 3
    },
    {
        "sentence1": "三个人在下棋。",
        "sentence2": "两个人在下棋。",
        "label": 2
    },
    {
        "sentence1": "一个女人在写作。",
        "sentence2": "一个女人在游泳。",
        "label": 0
    }
]

CoSENT 微调

train/路径下

cd train/

进行 CoSENT 微调

python cosent_finetune.py \
        --data_dir ../data/train_data.json \
        --output_dir ./outputs/my-model \
        --max_seq_length 1024 \
        --num_epochs 10 \
        --batch_size 64 \
        --learning_rate 2e-5

贡献

欢迎通过提交拉取请求或在仓库中提出问题来为此模块做出贡献。

License

本项目遵循Apache-2.0开源协议

Downloads last month
155
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support model that require custom code execution.

Model tree for OctopusMind/longbert-embedding-8k-zh

Finetunes
1 model