bongsoo's picture
Update README.md
d5aef9e
|
raw
history blame
4.93 kB
metadata
pipeline_tag: sentence-similarity
tags:
  - sentence-transformers
  - feature-extraction
  - sentence-similarity
  - transformers

albert-small-kor-sbert-v1

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
sentences = ["This is an example sentence", "Each sentence is converted"]

model = SentenceTransformer('bongsoo/albert-small-kor-sbert-v1')
embeddings = model.encode(sentences)
print(embeddings)

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('bongsoo/albert-small-kor-sbert-v1')
model = AutoModel.from_pretrained('bongsoo/albert-small-kor-sbert-v1')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

Evaluation Results

  • ์„ฑ๋Šฅ ์ธก์ •์„ ์œ„ํ•œ ๋ง๋ญ‰์น˜๋Š”, ์•„๋ž˜ ํ•œ๊ตญ์–ด (kor), ์˜์–ด(en) ํ‰๊ฐ€ ๋ง๋ญ‰์น˜๋ฅผ ์ด์šฉํ•จ
    ํ•œ๊ตญ์–ด : korsts(1,379์Œ๋ฌธ์žฅ) ์™€ klue-sts(519์Œ๋ฌธ์žฅ)
    ์˜์–ด : stsb_multi_mt(1,376์Œ๋ฌธ์žฅ) ์™€ glue:stsb (1,500์Œ๋ฌธ์žฅ)
  • ์„ฑ๋Šฅ ์ง€ํ‘œ๋Š” cosin.spearman
  • ํ‰๊ฐ€ ์ธก์ • ์ฝ”๋“œ๋Š” ์—ฌ๊ธฐ ์ฐธ์กฐ
  • ๋ชจ๋ธ korsts klue-sts glue(stsb) stsb_multi_mt(en)
    distiluse-base-multilingual-cased-v2 0.7475 0.7855 0.8193 0.8075
    paraphrase-multilingual-mpnet-base-v2 0.8201 0.7993 0.8907 0.8682
    bongsoo/moco-sentencedistilbertV2.1 0.8390 0.8767 0.8805 0.8548
    bongsoo/albert-small-kor-sbert-v1 0.8305 0.8588 0.8419 0.7965

For an automated evaluation of this model, see the Sentence Embeddings Benchmark: https://seb.sbert.net

Training

  • sts(10)-distil(10)-nli(3)-sts(10)

The model was trained with the parameters:

๊ณตํ†ต

  • do_lower_case=1, correct_bios=0, polling_mode=cls

1.STS

  • ๋ง๋ญ‰์น˜ : korsts(5,749) + kluestsV1.1(11,668) + stsb_multi_mt(5,749) + mteb/sickr-sts(9,927) + glue stsb(5,749) (์ด:38,842)
  • Param : lr: 1e-4, eps: 1e-6, warm_step=10%, epochs: 10, train_batch: 32, eval_batch: 64, max_token_len: 72
  • ํ›ˆ๋ จ์ฝ”๋“œ ์—ฌ๊ธฐ ์ฐธ์กฐ

2.distilation

  • ๊ต์‚ฌ ๋ชจ๋ธ : paraphrase-multilingual-mpnet-base-v2(max_token_len:128)
  • ๋ง๋ญ‰์น˜ : news_talk_en_ko_train.tsv (์˜์–ด-ํ•œ๊ตญ์–ด ๋Œ€ํ™”-๋‰ด์Šค ๋ณ‘๋ ฌ ๋ง๋ญ‰์น˜ : 1.38M)
  • Param : lr: 5e-5, epochs: 10, train_batch: 128, eval/test_batch: 64, max_token_len: 128(๊ต์‚ฌ๋ชจ๋ธ์ด 128์ด๋ฏ€๋กœ ๋งŸ์ถฐ์คŒ)
  • ํ›ˆ๋ จ์ฝ”๋“œ ์—ฌ๊ธฐ ์ฐธ์กฐ

3.NLI

  • ๋ง๋ญ‰์น˜ : ํ›ˆ๋ จ(967,852) : kornli(550,152), kluenli(24,998), glue-mnli(392,702) / ํ‰๊ฐ€(3,519) : korsts(1,500), kluests(519), gluests(1,500) ()
  • HyperParameter : lr: 3e-5, eps: 1e-8, warm_step=10%, epochs: 3, train/eval_batch: 64, max_token_len: 128
  • ํ›ˆ๋ จ์ฝ”๋“œ ์—ฌ๊ธฐ ์ฐธ์กฐ

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': True}) with Transformer model: AlbertModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

Citing & Authors

bongsoo