|
--- |
|
language: |
|
- ja |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- sentence-transformers |
|
- feature-extraction |
|
- sentence-similarity |
|
- transformers |
|
inference: false |
|
datasets: |
|
- shunk031/JGLUE |
|
- shunk031/jsnli |
|
- hpprc/jsick |
|
- miracl/miracl |
|
- castorini/mr-tydi |
|
- unicamp-dl/mmarco |
|
library_name: sentence-transformers |
|
--- |
|
|
|
# fio-base-japanese-v0.1 |
|
|
|
日本語版は近日公開予定です(日本語を勉強中なので、間違いはご容赦ください!) |
|
|
|
fio-base-japanese-v0.1 is a proof of concept, and the first release of the Fio family of Japanese embeddings. It is based on [cl-tohoku/bert-base-japanese-v3](https://huggingface.co./cl-tohoku/bert-base-japanese-v3) and trained on limited volumes of data on a single GPU. |
|
|
|
For more information, please refer to [my notes on Fio](https://ben.clavie.eu/fio). |
|
|
|
#### Datasets |
|
|
|
Similarity/Entailment: |
|
- JSTS (train) |
|
- JSNLI (train) |
|
- JNLI (train) |
|
- JSICK (train) |
|
|
|
Retrieval: |
|
- MMARCO (Multilingual Marco) (train, 124k sentence pairs, <1% of the full data) |
|
- Mr.TyDI (train) |
|
- MIRACL (train, 50% sample) |
|
- ~~JSQuAD (train, 50% sample, no LLM enhancement)~~ JSQuAD is not used in the released version, to serve as an unseen test set. |
|
|
|
#### Results |
|
|
|
This is adapted and truncated (to keep only the most popular models) from [oshizo's benchmarking github repo](https://github.com/oshizo/JapaneseEmbeddingEval), please check it out for more information and give it a star as it was very useful! |
|
|
|
Italic denotes best model for its size when a smaller model outperforms a bigger one (base/large | 768/1024), bold denotes best overall. |
|
|
|
| Model | JSTS valid-v1.1 | JSICK test | MIRACL dev | Average | |
|
|-------------------------------------------------|-----------------|------------|------------|---------| |
|
| bclavie/fio-base-japanese-v0.1 | **_0.863_** | **_0.894_** | 0.718 | _0.825_ | |
|
| cl-nagoya/sup-simcse-ja-base | 0.809 | 0.827 | 0.527 | 0.721 | |
|
| cl-nagoya/sup-simcse-ja-large | _0.831_ | _0.831_ | 0.507 | 0.723 | |
|
| colorfulscoop/sbert-base-ja | 0.742 | 0.657 | 0.254 | 0.551 | |
|
| intfloat/multilingual-e5-base | 0.796 | 0.806 | __0.845__ | 0.816 | |
|
| intfloat/multilingual-e5-large | 0.819 | 0.794 | **0.883** | **_0.832_** | |
|
| pkshatech/GLuCoSE-base-ja | 0.818 | 0.757 | 0.692 | 0.755 | |
|
| text-embedding-ada-002 | 0.790 | 0.789 | 0.7232 | 0.768 | |
|
|
|
|
|
|
|
## Usage |
|
|
|
This model requires both `fugashi` and `unidic-lite`: |
|
|
|
``` |
|
pip install -U fugashi unidic-lite |
|
``` |
|
|
|
If using for a retrieval task, you must prefix your query with `"関連記事を取得するために使用できるこの文の表現を生成します: "`. |
|
|
|
### Usage (Sentence-Transformers) |
|
|
|
This model is best used through [sentence-transformers](https://www.SBERT.net). If you don't have it, it's easy to install: |
|
|
|
``` |
|
pip install -U sentence-transformers |
|
``` |
|
|
|
Then you can use the model like this: |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
sentences = ["こんにちは、世界!", "文埋め込み最高!文埋め込み最高と叫びなさい", "極度乾燥しなさい"] |
|
|
|
model = SentenceTransformer('bclavie/fio-base-japanese-v0.1') |
|
embeddings = model.encode(sentences) |
|
print(embeddings) |
|
``` |
|
|
|
|
|
### Usage (HuggingFace Transformers) |
|
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModel |
|
import torch |
|
|
|
|
|
def cls_pooling(model_output, attention_mask): |
|
return model_output[0][:,0] |
|
|
|
|
|
# Sentences we want sentence embeddings for |
|
sentences = ['This is an example sentence', 'Each sentence is converted'] |
|
|
|
# Load model from HuggingFace Hub |
|
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}') |
|
model = AutoModel.from_pretrained('{MODEL_NAME}') |
|
|
|
# Tokenize sentences |
|
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') |
|
|
|
# Compute token embeddings |
|
with torch.no_grad(): |
|
model_output = model(**encoded_input) |
|
|
|
# Perform pooling. In this case, cls pooling. |
|
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask']) |
|
|
|
print("Sentence embeddings:") |
|
print(sentence_embeddings) |
|
``` |
|
|
|
## Citing & Authors |
|
|
|
```@misc{ |
|
bclavie-fio-embeddings, |
|
author = {Benjamin Clavié}, |
|
title = {Fio Japanese Embeddings}, |
|
year = {2023}, |
|
howpublished = {\url{https://ben.clavie.eu/fio}} |
|
}``` |