language:
- ja
- en
license: mit
tags:
- sentence-transformers
- sentence-similarity
- loss:MatryoshkaLoss
- loss:MultipleNegativesRankingLoss
datasets:
- hotchpotch/sentence_transformer_japanese
- sentence-transformers/msmarco-co-condenser-margin-mse-sym-mnrl-mean-v1
- sentence-transformers/squad
- sentence-transformers/all-nli
- sentence-transformers/trivia-qa
- nthakur/swim-ir-monolingual
- sentence-transformers/miracl
- sentence-transformers/mr-tydi
library_name: sentence-transformers
以äžã®æç« ã¯ãèšäºã100åéã§å®çšçãªæç« ãã¯ãã«ãäœãããæ¥æ¬èª StaticEmbedding ã¢ãã«ãå ¬é ããã®è»¢èŒã§ãã
static-embedding-japanese
æç« ã®å¯ãã¯ãã«ã¯ãæ å ±æ€çŽ¢ã»æç« å€å¥ã»é¡äŒŒæç« æœåºãªã©ãããŸããŸãªçšéã«äœ¿ãããšãã§ããŸããããããªããæå 端ã®Transformerã¢ãã«ã¯å°ããã¢ãã«ã§ãããšãããCPUç°å¢ã§ã¯åŠçé床ãé ãããå®çšã§ãªãããšããã°ãã°ãããŸãã
ãã®èª²é¡ã解決ããæ°ããã¢ãããŒããšããŠãå æ¥å ¬éãããTransformerã¢ãã«ãã§ã¯ãªãã StaticEmbeddingã¢ãã«ã¯ãäŸãã° intfloat/multilingual-e5-small (以äžmE5-small)ãšã®ãã³ãããŒã¯æ¯èŒã§ã¯85%ã®ã¹ã³ã¢ãšããæäœååãªæ§èœã§ãäœããCPUã§åäœæã«126åé«éã«æãã¯ãã«ãäœæããããšãã§ããããšããé©ãã®é床ã§ãã
ãšããããã§ãæ©éæ¥æ¬èª(ãšè±èª)ã§åŠç¿ãããã¢ãã« sentence-embedding-japanese ãäœæããå ¬éããŸããã
æ¥æ¬èªã®æç« ãã¯ãã«ã®æ§èœãè©äŸ¡ãã JMTEB ã®çµæã¯ä»¥äžã§ããç·åã¹ã³ã¢ã§ã¯ mE5-small ã«ã¯è¥å¹²åã°ãªããŸã§ããã¿ã¹ã¯ã«ãã£ãŠã¯åã£ãŠãããããŸãããä»ã®æ¥æ¬èªbaseãµã€ãºbertã¢ãã«ãããã¹ã³ã¢ãé«ãããšãããããããæäœéå®çšã§ããããªæ§èœãåºãŠããŸãããæ¬åœã«ãããªã«æ§èœãåºãã®ãå®éã«åŠç¿ãããŠã¿ããŸã§ã¯åä¿¡åçã§ããããé©ãã§ãã
Model | Avg(micro) | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
---|---|---|---|---|---|---|---|
text-embedding-3-small | 69.18 | 66.39 | 79.46 | 73.06 | 92.92 | 51.06 | 62.27 |
multilingual-e5-small | 67.71 | 67.27 | 80.07 | 67.62 | 93.03 | 46.91 | 62.19 |
static-embedding-japanese | 67.17 | 67.92 | 80.16 | 67.96 | 91.87 | 40.39 | 62.37 |
ãªããStaticEmbedding æ¥æ¬èªã¢ãã«åŠç¿ãªã©ã®æè¡çãªããšã¯èšäºã®åŸåã«æžããŠããã®ã§ãèå³ãããæ¹ã¯ã©ããã
å©çšæ¹æ³
å©çšã¯ç°¡åãSentenceTransformer ã䜿ã£ãŠãã€ãã®æ¹æ³ã§æç« ãã¯ãã«ãäœããŸããä»åã¯GPUã䜿ãããCPUã§å®è¡ããŠã¿ãŸãããããªã SentenceTransformer 㯠3.3.1 ã§è©ŠããŠããŸãã
pip install "sentence-transformers>=3.3.1"
from sentence_transformers import SentenceTransformer
model_name = "hotchpotch/static-embedding-japanese"
model = SentenceTransformer(model_name, device="cpu")
query = "çŸå³ããã©ãŒã¡ã³å±ã«è¡ããã"
docs = [
"çŽ æµãªã«ãã§ãè¿æã«ããããèœã¡çããé°å²æ°ã§ãã£ããã§ããããçªéã®åžããã¯å
¬åã®æ¯è²ãèŠãããã ã",
"æ°é®®ãªéä»ãæäŸããåºã§ããå°å
ã®æŒåž«ããçŽæ¥ä»å
¥ããŠããã®ã§é®®åºŠã¯æ矀ã§ãããæç人ã®è
ã確ãã§ãã",
"ãããã¯è¡ãã«ãããã©ãé ããè±éªšã®ååºã ããã¹ãŒããæé«ã ãã麺ã®ç¡¬ãã奜ã¿ã",
"ããããã®äžè¯ãã°ã®åºãæããŠãããããšããããã£ãŒã·ã¥ãŒãæäœãã§æããããŠãžã¥ãŒã·ãŒãªãã ã",
]
embeddings = model.encode([query] + docs)
print(embeddings.shape)
similarities = model.similarity(embeddings[0], embeddings[1:])
for i, similarity in enumerate(similarities[0].tolist()):
print(f"{similarity:.04f}: {docs[i]}")
(5, 1024)
0.1040: çŽ æµãªã«ãã§ãè¿æã«ããããèœã¡çããé°å²æ°ã§ãã£ããã§ããããçªéã®åžããã¯å
¬åã®æ¯è²ãèŠãããã ã
0.2521: æ°é®®ãªéä»ãæäŸããåºã§ããå°å
ã®æŒåž«ããçŽæ¥ä»å
¥ããŠããã®ã§é®®åºŠã¯æ矀ã§ãããæç人ã®è
ã確ãã§ãã
0.4835: ãããã¯è¡ãã«ãããã©ãé ããè±éªšã®ååºã ããã¹ãŒããæé«ã ãã麺ã®ç¡¬ãã奜ã¿ã
0.3199: ããããã®äžè¯ãã°ã®åºãæããŠãããããšããããã£ãŒã·ã¥ãŒãæäœãã§æããããŠãžã¥ãŒã·ãŒãªãã ã
ãã®ããã«ãqueryã«ãããããæç« ã®ã¹ã³ã¢ãé«ããªãããã«èšç®ã§ããŠãŸããããã®äŸæã§ã¯ãäŸãã°BM25ã§ã¯queryã«å«ãŸãããã©ãŒã¡ã³ãã®ãããªçŽæ¥çãªåèªãæç« ã«åºãŠããªããããããŸãããããããããšãé£ããã§ãããã
ç¶ããŠãé¡äŒŒæç« ã¿ã¹ã¯ã®äŸã§ãã
sentences = [
"ææ¥ã®ååŸããéšãéãã¿ããã§ãã",
"æ¥é±ã®æ¥ææ¥ã¯å€©æ°ãè¯ãããã ã",
"ãããã®æŒéãããåãå¿
èŠã«ãªãããã",
"é±æ«ã¯æŽãããšããäºå ±ãåºãŠããŸãã",
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# äžã€ç®ã®æç« ãšããã®ä»ã®æç« ã®é¡äŒŒåºŠã衚瀺
for i, similarity in enumerate(similarities[0].tolist()):
print(f"{similarity:.04f}: {sentences[i]}")
tensor([[1.0000, 0.2814, 0.3620, 0.2818],
[0.2814, 1.0000, 0.2007, 0.5372],
[0.3620, 0.2007, 1.0000, 0.1299],
[0.2818, 0.5372, 0.1299, 1.0000]])
1.0000: ææ¥ã®ååŸããéšãéãã¿ããã§ãã
0.2814: æ¥é±ã®æ¥ææ¥ã¯å€©æ°ãè¯ãããã ã
0.3620: ãããã®æŒéãããåãå¿
èŠã«ãªãããã
0.2818: é±æ«ã¯æŽãããšããäºå ±ãåºãŠããŸãã
ãã¡ãããé¡äŒŒæç« ãé«ã¹ã³ã¢ã«ãªãçµæã«ãªããŸããã
ãŸãTransformerã¢ãã«ãå©çšããŠCPUã§æç« ãã¯ãã«ãäœã£ãå Žåãå°ãªãæç« éã§ãã ãã¶æéãããããšããçµéšããããæ¹ãå€ããšæããŸããStaticEmbedding ã¢ãã«ã§ã¯CPUãããããéããã°äžç¬ã§çµããã¯ããããã100åéã
åºå次å ãå°ãããã
æšæºã§äœãããæãã¯ãã«ã®æ¬¡å ã¯1024ã§ããããããããã«å°ãã次å åæžããããšãã§ããŸããäŸãã° 128 ãæå®ããŠã¿ãŸãããã
# truncate_dim 㯠32, 64, 128, 256, 512, 1024 ããæå®
model = SentenceTransformer(model_name, device="cpu", truncate_dim=128)
query = "çŸå³ããã©ãŒã¡ã³å±ã«è¡ããã"
docs = [
"çŽ æµãªã«ãã§ãè¿æã«ããããèœã¡çããé°å²æ°ã§ãã£ããã§ããããçªéã®åžããã¯å
¬åã®æ¯è²ãèŠãããã ã",
"æ°é®®ãªéä»ãæäŸããåºã§ããå°å
ã®æŒåž«ããçŽæ¥ä»å
¥ããŠããã®ã§é®®åºŠã¯æ矀ã§ãããæç人ã®è
ã確ãã§ãã",
"ãããã¯è¡ãã«ãããã©ãé ããè±éªšã®ååºã ããã¹ãŒããæé«ã ãã麺ã®ç¡¬ãã奜ã¿ã",
"ããããã®äžè¯ãã°ã®åºãæããŠãããããšããããã£ãŒã·ã¥ãŒãæäœãã§æããããŠãžã¥ãŒã·ãŒãªãã ã",
]
embeddings = model.encode([query] + docs)
print(embeddings.shape)
similarities = model.similarity(embeddings[0], embeddings[1:])
for i, similarity in enumerate(similarities[0].tolist()):
print(f"{similarity:.04f}: {docs[i]}")
(5, 128)
0.1464: çŽ æµãªã«ãã§ãè¿æã«ããããèœã¡çããé°å²æ°ã§ãã£ããã§ããããçªéã®åžããã¯å
¬åã®æ¯è²ãèŠãããã ã
0.3094: æ°é®®ãªéä»ãæäŸããåºã§ããå°å
ã®æŒåž«ããçŽæ¥ä»å
¥ããŠããã®ã§é®®åºŠã¯æ矀ã§ãããæç人ã®è
ã確ãã§ãã
0.5923: ãããã¯è¡ãã«ãããã©ãé ããè±éªšã®ååºã ããã¹ãŒããæé«ã ãã麺ã®ç¡¬ãã奜ã¿ã
0.3405: ããããã®äžè¯ãã°ã®åºãæããŠãããããšããããã£ãŒã·ã¥ãŒãæäœãã§æããããŠãžã¥ãŒã·ãŒãªãã ã
128次å ã®ãã¯ãã«ã«ãªããçµæã®ã¹ã³ã¢ãè¥å¹²å€ãããŸãããã次å ãå°ãããªã£ãããšã§ãæ§èœãå°ã å£åããŠããŸã(åŸåã«ãã³ãããŒã¯ãèšèŒ)ããã 1024次å ãã128次å ã«æžãããšã§ãä¿åããã¹ãã¬ãŒãžãµã€ãºãæžã£ãããæ€çŽ¢æãªã©ã«å©çšããé¡äŒŒåºŠèšç®ã³ã¹ããçŽ8åéã«ãªã£ãããšãªã£ãããšãçšéã«ãã£ãŠã¯å°ãã次å ã®æ¹ãå¬ããããšãå€ãã§ãããã
ãªãCPUã§æšè«ãé«éãªã®ïŒ
StaticEmbedding ã¯Transformerã¢ãã«ã§ã¯ãããŸãããã€ãŸãTrasformerã®ç¹åŸŽã§ãã "Attention Is All You Need" ãªã¢ãã³ã·ã§ã³ã®èšç®ãäžåãªãã®ã§ããæç« ã«åºãŠããåèªããŒã¯ã³ã1024次å ã®ããŒãã«ã«ä¿åããŠãæãã¯ãã«äœææã«ã¯ããã®å¹³åããšã£ãŠããã ãã§ãããªããã¢ãã³ã·ã§ã³ããªãã®ã§ãæèã®ç解ãªã©ã¯ããŠããŸããã
ãŸãå éšå®è£ ã§ã¯ PyTorch ã® nn.EmbeddingBag ã䜿ã£ãŠãå šãŠãé£çµããããŒã¯ã³ãšãªãã»ãããæž¡ããŠåŠçããããšã§ãPyTorch ã®æé©åã§é«éãªCPU䞊ååŠçãšã¡ã¢ãªã¢ã¯ã»ã¹ããããŠããããã§ãã
å èšäºã®é床è©äŸ¡çµæã«ãããšCPUã§ã¯mE5-smallãšæ¯ã¹ãŠ126åéãããã§ããã
è©äŸ¡çµæ
JMTEBã§ã®å šãŠã®è©äŸ¡çµæã¯ãã¡ãJSONãã¡ã€ã«ã«èšèŒããŠããŸããJMTEB Leaderboardã§ä»ã®ã¢ãã«ãšèŠæ¯ã¹ããšãçžå¯Ÿçãªå·®ããããã§ããããJMTEBã®å šäœã®è©äŸ¡çµæã¯ã¢ãã«ãµã€ãºãèãããšãããã¶ãè¯å¥œã§ãããªããJMTEB ã®mr-tidy ã¿ã¹ã¯ã¯700äžæç« ã®ãã¯ãã«åãè¡ãã®ã§åŠçã«æéãããªãããã(ã¢ãã«ã«ããããŸããRTX4090ã§1~4æéã»ã©)ãšæããŸãããããStaticEmbeddingsã§ã¯éåžžã«éããRTX4090ã§ã¯çŽ4åã§åŠççµããããšãã§ããŸããã
æ å ±æ€çŽ¢ã§BM25ã®çœ®ãæããã§ãããã?
JMTEBã®äžã®æ å ±æ€çŽ¢ã¿ã¹ã¯ã®Retrievalã®çµæãèŠãŠã¿ãŸããããStaticEmbedding ã§ã¯ mr-tidy ã®é ç®ãèããæªãã§ãããmr-tidyã¯ä»ã®ã¿ã¹ã¯ã«æ¯ã¹ãŠæç« éãå§åçã«å€ã(700äžæç« )ãã€ãŸãæ倧éã®æç« ãæ€çŽ¢ãããããªã¿ã¹ã¯ã§ã¯çµæãæªãå¯èœæ§ãããããã§ããæèãç¡èŠãããåçŽãªããŒã¯ã³ã®å¹³åãªã®ã§ãå¢ããã°å¢ããã»ã©äŒŒãå¹³åã®æç« ãåºãŠãããšãããšãããããçµæã«ããªãåŸããã§ããã
ã®ã§ã倧éã®æç« ã®å ŽåãBM25ãããã ãã¶æ§èœãæªãå¯èœæ§ãããããã§ãããã ãå°ãªãæç« ã§ããã°ãã®åèªããããå°ãªãå Žåã¯ãBM25ãããè¯å¥œãªçµæã«ãªãããšãå€ããã§ããã
ãªãæ å ±æ€çŽ¢ã¿ã¹ã¯ã® jaqket ã®çµæãä»ã®ã¢ãã«ã«å¯ŸããŠãããè¯ãã®ã¯ãjaqket ã®åé¡ãå«ã JQaRa (dev, unused)ãåŠç¿ããŠãããããšãã£ãŠããé«ãããæãã§è¬ã§ããtest ã®æ å ±ãªãŒã¯ã¯ããŠããªããšã¯æãã®ã§ããâŠã
ã¯ã©ã¹ã¿ãªã³ã°çµæãæªã
ãã¡ãã詳现ã¯è¿œã£ãããŠããŸããããã¹ã³ã¢çã«ã¯ä»ã®ã¢ãã«ãããã ãã¶æªãçµæã§ãããã¯ã©ã¹åé¡ã¿ã¹ã¯ã¯æªããªãã®ã§äžæè°ã§ããåã蟌ã¿ç©ºéããããªã§ãŒã·ã«è¡šçŸåŠç¿ã§äœããã圱é¿ãããã®ã§ããããã
JQaRA, JaCWIR ã§ã®ãªã©ã³ãã³ã°ã¿ã¹ã¯è©äŸ¡
JQaRA ã®çµæã¯ãã¡ãã
model_names | ndcg@10 | mrr@10 |
---|---|---|
static-embedding-japanese | 0.4704 | 0.6814 |
bm25 | 0.458 | 0.702 |
multilingual-e5-small | 0.4917 | 0.7291 |
JaCWIR ã®çµæã¯ãã¡ãã
model_names | map@10 | hits@10 |
---|---|---|
static-embedding-japanese | 0.7642 | 0.9266 |
bm25 | 0.8408 | 0.9528 |
multilingual-e5-small | 0.869 | 0.97 |
JQaRa è©äŸ¡ã¯ BM25 ããã¯è¥å¹²è¯ããmE5-small ããã¯è¥å¹²äœããJaCWIR 㯠BM25, mE5ããã ãã¶äœãæãã®çµæã«ãªããŸããã
JaCWIR ã¯queryããæ¢ãããŠãæç« ããWebæç« ã®ã¿ã€ãã«ãšæŠèŠæãªã®ã§ãããããã綺éºãªãæç« ã§ã¯ãªãã±ãŒã¹ãå€ãã§ããtransformerã¢ãã«ã¯ãã€ãºã«åŒ·ãã®ã§ãåçŽãªããŒã¯ã³å¹³åã®StaticEmbeddingã§ã¯ã¹ã³ã¢ã«å·®ãã€ããããã®ãçŽåŸã§ãããBM25ã¯ç¹åŸŽçãªåèªãåºçŸããæç« ã«ãããããã®ã§ãJaCWIR ã§ããã€ãºãšãªããããªæç« äžã®åèªã¯ã¯ãšãªã«ãããããããããªããããTransformer ã¢ãã«ãšç«¶äºåã®ããçµæ§è¯ãçµæãæ®ããŠããŸãã
ãã®çµæãããStaticEmbedding 㯠Transformer / BM25 ã«æ¯ã¹ããã€ãºãå€ãå«ãæç« ã®å Žåã¯ã¹ã³ã¢ãæªãå¯èœæ§ããããŸãã
åºå次å ã®åæž
StaticEmbedding ã§åºåããã次å ã¯ãåŠç¿æ¬¡ç¬¬ã§ããä»åäœæãããã®ã¯1024次å ãšããããã®ãµã€ãºã§ãã次å æ°ã倧ãããšãæšè«åŸã®ã¿ã¹ã¯(ã¯ã©ã¹ã¿ãªã³ã°ãæ å ±æ€çŽ¢ãªã©)ã«èšç®ã³ã¹ããããã£ãŠããŸããŸããããããªãããåŠç¿æã«ãããªã§ãŒã·ã«è¡šçŸåŠç¿(Matryoshka Representation Learning(MRL))ãããŠããããã1024次å ãããã«å°ããªæ¬¡å ãžãšç°¡åã«æ¬¡å åæžãã§ããŸãã
MRLã¯ãåŠç¿æã«å é ã®ãã¯ãã«ã»ã©éèŠãªæ¬¡å ãæã£ãŠããããšã§ãäŸãã°1024次å ã§ãå é ã®32,64,128,256...次å ã ãã䜿ã£ãŠåŸããåãæšãŠãã ãã§ãããçšåºŠè¯å¥œãªçµæã瀺ããŠããŸãã
ãã®ã°ã©ãåç §å ã®StaticEmbedding ã®èšäºã«ãããšã128次å ã§91.87%, 256次å ã§95.79%, 512次å ã§98.53%ã®æ§èœãç¶æããŠããããã§ãã粟床ã«ãããŸã§ã·ãã¢ã§ã¯ãªããããã®åŸã®èšç®ã³ã¹ããäžãããå Žåãã¬ããšæ¬¡å åæžããŠäœ¿ãããšããçšéã«ã䜿ãããã§ããã
StaticEmbdding æ¥æ¬èªã¢ãã«ã§ã®æ¬¡å åæžçµæ
JMTEB ã§ã¯ãåºåæã«ã¢ãã«ã®ãã©ã¡ãŒã¿ãå¶åŸ¡ã§ãããããtruncate_dim ãªãã·ã§ã³ãæž¡ãããšã§ã次å åæžããçµæã®ãã³ãããŒã¯ãç°¡åã«èšæž¬ã§ããŸããçŽ æŽãããã§ããããšããããã§ãStaticEmbdding æ¥æ¬èªã¢ãã«ã§ãã次å åæžããçµæã§ãã³ãããŒã¯ããšã£ãŠã¿ãŸããã
次å æ° | Avg(micro) | ã¹ã³ã¢å²å(%) | Retrieval | STS | Classification | Reranking | Clustering | PairClassification |
---|---|---|---|---|---|---|---|---|
1024 | 67.17 | 100.00 | 67.92 | 80.16 | 67.96 | 91.87 | 40.39 | 62.37 |
512 | 66.57 | 99.10 | 67.63 | 80.11 | 65.66 | 91.54 | 41.25 | 62.37 |
256 | 65.94 | 98.17 | 66.99 | 79.93 | 63.53 | 91.73 | 42.55 | 62.37 |
128 | 64.25 | 95.65 | 64.87 | 79.56 | 60.52 | 91.62 | 41.81 | 62.33 |
64 | 61.79 | 91.98 | 61.15 | 78.34 | 58.23 | 91.50 | 39.11 | 62.35 |
32 | 57.93 | 86.24 | 53.35 | 76.51 | 55.95 | 91.15 | 38.20 | 62.37 |
ã¹ã³ã¢ã®å€åãèŠããšã512次å
ãžãšæ¬¡å
åæžããå Žåã¯ãããRetrieval, Classification,Reranking ã®æ§èœãæªããªããŸããããã256次å
ãŸã§æ¬¡å
åæžããŠããŸã£ãæ¹ãè¯å¥œãªçµæã«ã256次å
ã§ã¯ãã¹ã³ã¢çã«ã¯æ¬¡å
åæžããåã®ã¢ãã«ã®98.93%ãªãã§ãããããã¯ã¯ã©ã¹ã¿ãªã³ã°ã®çµæããªãã1024次å
ãããè¯ããªã£ãŠããŸã£ãããã§ããã
512次å ã§ã®ã¹ã³ã¢èšæž¬ãééã£ãŠããã®ã§ä¿®æ£ããŸããããããªã§ãŒã·ã«è¡šçŸåŠç¿ãããŸãåæ ããã次å æ°ãåããšè¥å¹²ã®ã¹ã³ã¢äœäžãèŠãããŸããã次å æ°ãæžã£ããããã®åŸã®ã³ã¹ããæãããããã§ããã
ã¯ã©ã¹ã¿ãªã³ã°ã¿ã¹ã¯ã«ãããŠã¯128次å ãŸã§æ¬¡å åæžããŠã1024次å ãããã¹ã³ã¢ãé«ãããšããæ¬æ¥æ å ±éãåããªãæ¹ãã¹ã³ã¢ãè¯ãããªããããªã®ã«ãã¯ã©ã¹ã¿ãªã³ã°ã¿ã¹ã¯ã®ã¿ã¯éã«ã¹ã³ã¢ãäžãã£ãŠããŸãèå³æ·±ãçµæãšãªããŸããâŠããããªã§ãŒã·ã«è¡šçŸåŠç¿ã§ã¯ãå é ã®æ¬¡å ã®æ¹ãå šäœçãªç¹åŸŽãèžãŸããŠããã®ã§ãã¯ã©ã¹ã¿ãªã³ã°çšéã«ã¯(ã¯ã©ã¹ã¿ãªã³ã°ã®ã¢ã«ãŽãªãºã ã«ããããšæããŸãã)ãç¹åŸŽçãªåã®æ¹ã®æ¬¡å ã®ã¿ã§åŸãã®æ¬¡å ã䜿ããªãæ¹ãè¯è³ªãªçµæãåŸãããããšããããšãªã®ãããããŸããã
ãšããããã§ãstatic-embedding-japanese ã¢ãã«ã§æ¬¡å åæžããæã¯ã512,256,128次å ããããæ§èœãšæ¬¡å åæžã®ãã©ã³ã¹ãåããŠããã§ããã
StaticEmbedding ã¢ãã«ãäœã£ãŠã¿ãŠ
æ£çŽãåçŽãªããŒã¯ã³ã®embeddingsã®å¹³åã§ãããªã«æ§èœåºãã®ãåä¿¡åçã ã£ãã®ã§ãããå®éã«åŠç¿ãããŠã¿ãŠã·ã³ãã«ãªã¢ãŒããã¯ãã£ãªã®ã«æ§èœã®é«ãã«ã³ã£ããããŸãããTransformer å šçã®ãã®æ代ã«ãå€ãè¯ãåèªåã蟌ã¿ã®æŽ»çšã¢ãã«ã§ãå®äžçã§å©æŽ»çšã§ããããªã¢ãã«ã®åºçŸã«é©ããé ããŸããã
CPUã§ã®æšè«é床ãéãæãã¯ãã«äœæã¢ãã«ã¯ãããŒã«ã«CPUç°å¢ã§å€§éã®æç« ã®å€æãªã©ã¯ããšããããšããžããã€ã¹ã ã£ãããããã¯ãŒã¯ãé ã(ãªã¢ãŒãã®æšè«ãµãŒããå©ããªã)ç°å¢ã ã£ãããè²ã ãšæŽ»çšã§ãããã§ããã
StaticEmbedding æ¥æ¬èªã¢ãã«åŠç¿ã®ãã¯ãã«ã«ããŒã
ãªãããŸãåŠç¿ã§ããã®ã
StaticEmbedding ã¯éåžžã«ã·ã³ãã«ã§ãæç« ãããŒã¯ãã€ãºããIDã§åèªã®åã蟌ã¿ãã¯ãã«ãæ ŒçŽãããŠããEmbeddingBagããŒãã«ããN次å (ä»åã¯1024次å )ã®ãã¯ãã«ãååŸãããã®å¹³åãåãã ãã§ãã
ãããŸã§ãåèªåã蟌ã¿ãã¯ãã«ãšããã°ãword2vec ã GloVe ã®ããã« Skip-gram ã CBOW ãçšããŠåèªã®åšèŸºãåŠç¿ããŠããŸãããããããStaticEmbedding ã§ã¯æç« å šäœãçšããŠåŠç¿ããŠããŸãããŸããå¯Ÿç §åŠç¿ã䜿ã£ãŠå€§éã®æ§ã ãªæç« ã巚倧ãããã§åŠç¿ããŠãããè¯ãåèªã®åã蟌ã¿è¡šçŸã®åŠç¿ã«æåããŠããŸãã
å¯Ÿç §åŠç¿ã¯ãåºæ¬çã«æ£äŸä»¥å€å šãŠãè² äŸãšããŠåŠç¿ãããããäŸãã°ããããµã€ãº2048ãªã1ã®æ£äŸã«å¯ŸããŠ2047ã®è² äŸã2048éããã€ãŸã2048x2047ã§çŽ400äžã®æ¯èŒãåŠç¿ããŸãããã®ãããå ã®åèªç©ºéã«å¯ŸããŠé©åãªéã¿ãæŽæ°ããªãããåŠç¿ãé²ããããšãã§ããã®ã§ãã
åŠç¿ããŒã¿ã»ãã
æ¥æ¬èªã¢ãã«åŠç¿ã«ããããå¯Ÿç §åŠç¿ã§å©çšã§ããããŒã¿ã»ãããšããŠã以äžãäœæã䜿çšããŸããã
- hotchpotch/sentence_transformer_japanese
- SentenceTransformer ã§åŠç¿ããããã«ã©ã åãšæ§é ã«æŽãããã®ã§ãã
(anchor, positive)
,(anchor, positive, negative)
,(anchor, positive, negative_1, ..., negative_n)
ãšãã£ãæ§é ã«ãªã£ãŠããŸãã
- 以äžã®ããŒã¿ã»ãããåºã« hotchpotch/sentence_transformer_japanese ãäœæããŸãããæ¯åºŠãªããããŒã¿ã»ããã®äœè
ã®æ¹ã
ã»ãšããã hpprc æ°ã«æè¬ã§ãã
- https://huggingface.co./datasets/hpprc/emb
- https://huggingface.co./datasets/hotchpotch/hpprc_emb-scores ã®ãªã©ã³ã«ãŒã¹ã³ã¢ã䜿çšããpositive(>=0.7) / negative(<=0.3) ã®ãã£ã«ã¿ãªã³ã°ãè¡ããŸããã
- https://huggingface.co./datasets/hpprc/llmjp-kaken
- https://huggingface.co./datasets/hpprc/msmarco-ja
- https://huggingface.co./datasets/hotchpotch/msmarco-ja-hard-negatives ã®ãªã©ã³ã«ãŒã¹ã³ã¢ãçšããŠãpositive(>=0.7) / negative(<=0.3) ã®ãã£ã«ã¿ãªã³ã°ãè¡ããŸããã
- https://huggingface.co./datasets/hpprc/mqa-ja
- https://huggingface.co./datasets/hpprc/llmjp-warp-html
- https://huggingface.co./datasets/hpprc/emb
- SentenceTransformer ã§åŠç¿ããããã«ã©ã åãšæ§é ã«æŽãããã®ã§ãã
- äžèšã®äœæããããŒã¿ã»ããã®äžã§ã以äžã䜿çšããŸããããªããæ
å ±æ€çŽ¢ã匷åãããã£ããããæ
å ±æ€çŽ¢ã«é©ããããŒã¿ã»ããã®ããŒã¿ã¯ãªãŒã®ã¥ã¡ã³ããŒã·ã§ã³ã§ä»¶æ°ãå€ãã«åŠç¿ãããŠããŸãã
- httprc_auto-wiki-nli-triplet
- httprc_auto-wiki-qa
- httprc_auto-wiki-qa-nemotron
- httprc_auto-wiki-qa-pair
- httprc_baobab-wiki-retrieval
- httprc_janli-triplet
- httprc_jaquad
- httprc_jqara
- httprc_jsnli-triplet
- httprc_jsquad
- httprc_miracl
- httprc_mkqa
- httprc_mkqa-triplet
- httprc_mr-tydi
- httprc_nu-mnli-triplet
- httprc_nu-snli-triplet
- httprc_quiz-no-mori
- httprc_quiz-works
- httprc_snow-triplet
- httprc_llmjp-kaken
- httprc_llmjp_warp_html
- httprc_mqa_ja
- httprc_msmarco_ja
- è±èªããŒã¿ã»ããã«ã¯ã以äžã®ããŒã¿ã»ãããå©çšããŠããŸãã
æ¥æ¬èªããŒã¯ãã€ã¶
StaticEmbedding ãåŠç¿ããããã«ã¯ãHuggingFace ã®ããŒã¯ãã€ã¶ã©ã€ãã©ãªã® tokenizer.json 圢åŒã§åŠçå¯èœãªããŒã¯ãã€ã¶ã䜿ããšç°¡åããã ã£ãã®ã§ã hotchpotch/xlm-roberta-japanese-tokenizer ãšããããŒã¯ãã€ã¶ãäœæããŸãããèªåœæ°ã¯ 32,768 ã§ãã
ãã®ããŒã¯ãã€ã¶ã¯ãwikipedia æ¥æ¬èªãwikipedia è±èª(ãµã³ããªã³ã°)ãcc-100(æ¥æ¬èª, ãµã³ããªã³ã°)(èšæ£:äœæã³ãŒãã確èªãããšãããwikipediaæ¥æ¬èªã®ã¿ãå©çšããŠããŸãã)ã®ããŒã¿ã unidic ã§åå²ããsentencepiece unigram ã§åŠç¿ãããã®ã§ããXLM-Roberta 圢åŒã®æ¥æ¬èªããŒã¯ãã€ã¶ãšããŠãæ©èœããŸããä»åã¯ãã®ããŒã¯ãã€ã¶ãå©çšããŸããã
ãã€ããŒãã©ã¡ãŒã¿
倧å ã®åŠç¿ã³ãŒããšã®å€æŽç¹ãã¡ã¢ã¯ä»¥äžã®éãã§ãã
- batch_size ã倧å
ã® 2048 ãã 6072 ã«èšå®ããŸããã
- å¯Ÿç §åŠç¿ã§å·šå€§ãªããããåŠçãããšããåäžãããå ã«ããžãã£ããšãã¬ãã£ããå«ãŸãããšåŠç¿ã«æªåœ±é¿ãäžããå¯èœæ§ããããŸãããããé²ãããã« BatchSamplers.NO_DUPLICATES ãªãã·ã§ã³ããããŸããããããããããµã€ãºã巚倧ã ãšåäžãããã«å«ããªãããã®ãµã³ããªã³ã°åŠçã«æéããããããšããããŸãã
- ä»åã¯
BatchSamplers.NO_DUPLICATES
ãæå®ããRTX4090 ã® 24GB ã«åãŸã 6072 ã«èšå®ããŸãããããããµã€ãºã¯ããã«å€§ããæ¹ãçµæãè¯ãå¯èœæ§ããããŸãã
- epochæ°ã1ãã2ã«å€æŽããŸãã
- 1ããã2ã®æ¹ãè¯ãçµæã«ãªããŸããããã ããããŒã¿ãµã€ãºããã£ãšå€§ãããã°ã1ã®æ¹ãè¯ãå¯èœæ§ããããŸãã
- ã¹ã±ãžã¥ãŒã©
- æšæºã®linearãããçµéšåã§ããè¯ããšæããcosineã«å€æŽããŸããã
- ãªããã£ãã€ã¶
- æšæºã®AdamW ã®ãŸãŸã§ããadafactorã«å€æŽããå ŽåãåæãæªããªããŸããã
- learning_rate
- 2e-1 ã®ãŸãŸã§ããå€ã巚倧ãããã®ã§ã¯ãªãããšçåã«æããŸããããäœããããšçµæãæªåããŸããã
- dataloader_prefetch_factor=4
- dataloader_num_workers=15
- ããŒã¯ãã€ãºãšããããµã³ãã©ã®ãµã³ããªã³ã°ã«æéããããããã倧ããã«èšå®ããŸããã
åŠç¿ãªãœãŒã¹
- CPU
- Ryzen9 7950X
- GPU
- RTX4090
- memory
- 64GB
ãã®ãã·ã³ãªãœãŒã¹ã§ããã«ã¹ã¯ã©ããåŠç¿ã«ããã£ãæéã¯çŽ4æéã§ãããGPUã®ã³ã¢è² è·ã¯éåžžã«å°ãããä»ã®transformerã¢ãã«ã§ã¯åŠç¿æã«90%ååŸã§åŒµãä»ãã®ã«å¯ŸããŠãStaticEmbeddingã§ã¯ã»ãšãã©0%ã§ãããããã¯ã巚倧ãªããããGPUã¡ã¢ãªã«è»¢éããæéã倧åãå ããŠããããããšæãããŸãããã®ãããGPUã¡ã¢ãªã®åž¯åå¹ ãéããªãã°ãåŠç¿é床ãããã«åäžããå¯èœæ§ããããŸãã
ãããªãæ§èœåäžãž
ä»åå©çšããããŒã¯ãã€ã¶ã¯StaticEmbeddingåãã«ç¹åãããã®ã§ã¯ãªããããããé©ããããŒã¯ãã€ã¶ã䜿çšããã°æ§èœãåäžããå¯èœæ§ããããŸããããããµã€ãºãããã«å·šå€§åããããšã§ãåŠç¿ã®å®å®æ§ãåäžããæ§èœåäžãèŠèŸŒãããããããŸããã
ãŸããããŸããŸãªãã¡ã€ã³ãåæããŒã¿ã»ãããå©çšãããªã©ãããå¹ åºãæç« ãªãœãŒã¹ãåŠç¿ã«çµã¿èŸŒãããšã§ããããªãæ§èœåäžãæåŸ ã§ããŸãã
倧å ã®åŠç¿ã³ãŒã
åŠç¿ã«äœ¿çšããã³ãŒãã¯ã以äžã§ MIT ã©ã€ã»ã³ã¹ã§å ¬éããŠããŸããã¹ã¯ãªãããå®è¡ããã°åçŸã§ãããã¯ã...!
ã©ã€ã»ã³ã¹
static-embedding-japanese ã¯ã¢ãã«éã¿ã»åŠç¿ã³ãŒãã MIT ã©ã€ã»ã³ã¹ã§å ¬éããŠããŸãã