Базовый Bert для Semantic text similarity (STS) на CPU

Базовая модель BERT для расчетов компактных эмбеддингов предложений на русском языке. Модель основана на cointegrated/rubert-tiny2 - имеет аналогичные размеры контекста (2048) и ембеддинга (312), количество слоев увеличено с 3 до 7.

Использование модели с библиотекой transformers:

# pip install transformers sentencepiece
import torch
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sergeyzh/rubert-mini-sts")
model = AutoModel.from_pretrained("sergeyzh/rubert-mini-sts")
# model.cuda()  # uncomment it if you have a GPU

def embed_bert_cls(text, model, tokenizer):
    t = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**{k: v.to(model.device) for k, v in t.items()})
    embeddings = model_output.last_hidden_state[:, 0, :]
    embeddings = torch.nn.functional.normalize(embeddings)
    return embeddings[0].cpu().numpy()

print(embed_bert_cls('привет мир', model, tokenizer).shape)
# (312,)

Использование с sentence_transformers:

from sentence_transformers import SentenceTransformer, util

model = SentenceTransformer('sergeyzh/rubert-mini-sts')

sentences = ["привет мир", "hello world", "здравствуй вселенная"]
embeddings = model.encode(sentences)
print(util.dot_score(embeddings, embeddings))

Метрики

Оценки модели на бенчмарке encodechka:

Модель STS PI NLI SA TI
intfloat/multilingual-e5-large 0.862 0.727 0.473 0.810 0.979
sergeyzh/LaBSE-ru-sts 0.845 0.737 0.481 0.805 0.957
sergeyzh/rubert-mini-sts 0.815 0.723 0.477 0.791 0.949
sergeyzh/rubert-tiny-sts 0.797 0.702 0.453 0.778 0.946
Tochka-AI/ruRoPEBert-e5-base-512 0.793 0.704 0.457 0.803 0.970
cointegrated/LaBSE-en-ru 0.794 0.659 0.431 0.761 0.946
cointegrated/rubert-tiny2 0.750 0.651 0.417 0.737 0.937

Задачи:

  • Semantic text similarity (STS);
  • Paraphrase identification (PI);
  • Natural language inference (NLI);
  • Sentiment analysis (SA);
  • Toxicity identification (TI).

Быстродействие и размеры

На бенчмарке encodechka:

Модель CPU GPU size dim n_ctx n_vocab
intfloat/multilingual-e5-large 149.026 15.629 2136 1024 514 250002
sergeyzh/LaBSE-ru-sts 42.835 8.561 490 768 512 55083
sergeyzh/rubert-mini-sts 6.417 5.517 123 312 2048 83828
sergeyzh/rubert-tiny-sts 3.208 3.379 111 312 2048 83828
Tochka-AI/ruRoPEBert-e5-base-512 43.314 9.338 532 768 512 69382
cointegrated/LaBSE-en-ru 42.867 8.549 490 768 512 55083
cointegrated/rubert-tiny2 3.212 3.384 111 312 2048 83828
Downloads last month
207
Safetensors
Model size
32.4M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for sergeyzh/rubert-mini-sts

Finetuned
(42)
this model
Finetunes
1 model

Datasets used to train sergeyzh/rubert-mini-sts