File size: 4,806 Bytes

---
language:
- ja
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
inference: false
datasets:
- shunk031/JGLUE
- shunk031/jsnli
- hpprc/jsick
- miracl/miracl
- castorini/mr-tydi
- unicamp-dl/mmarco
library_name: sentence-transformers
---

# fio-base-japanese-v0.1

日本語版は近日公開予定です（日本語を勉強中なので、間違いはご容赦ください！）

fio-base-japanese-v0.1 is a proof of concept, and the first release of the Fio family of Japanese embeddings. It is based on [cl-tohoku/bert-base-japanese-v3](https://huggingface.co./cl-tohoku/bert-base-japanese-v3) and trained on limited volumes of data on a single GPU.

For more information, please refer to [my notes on Fio](https://ben.clavie.eu/fio).

#### Datasets

Similarity/Entailment:
- JSTS (train)
- JSNLI (train)
- JNLI (train)
- JSICK (train)

Retrieval:
- MMARCO (Multilingual Marco) (train, 124k sentence pairs, <1% of the full data)
- Mr.TyDI (train)
- MIRACL (train, 50% sample)
- ~~JSQuAD (train, 50% sample, no LLM enhancement)~~ JSQuAD is not used in the released version, to serve as an unseen test set.

#### Results

This is adapted and truncated (to keep only the most popular models) from [oshizo's benchmarking github repo](https://github.com/oshizo/JapaneseEmbeddingEval), please check it out for more information and give it a star as it was very useful!

Italic denotes best model for its size when a smaller model outperforms a bigger one (base/large | 768/1024), bold denotes best overall.

| Model                                           | JSTS valid-v1.1 | JSICK test | MIRACL dev | Average |
|-------------------------------------------------|-----------------|------------|------------|---------|
| bclavie/fio-base-japanese-v0.1                  | **_0.863_**           | **_0.894_**     | 0.718        | _0.825_     |
| cl-nagoya/sup-simcse-ja-base                    | 0.809           | 0.827      | 0.527      | 0.721   |
| cl-nagoya/sup-simcse-ja-large                   | _0.831_           | _0.831_      | 0.507      | 0.723   |
| colorfulscoop/sbert-base-ja                     | 0.742           | 0.657      | 0.254      | 0.551   |
| intfloat/multilingual-e5-base                   | 0.796           | 0.806      | __0.845__      | 0.816   |
| intfloat/multilingual-e5-large                  | 0.819           | 0.794      | **0.883**      | **_0.832_**   |
| pkshatech/GLuCoSE-base-ja                       | 0.818           | 0.757      | 0.692      | 0.755   |
| text-embedding-ada-002                          | 0.790           | 0.789      | 0.7232     | 0.768   |



## Usage

This model requires both `fugashi` and `unidic-lite`:

```
pip install -U fugashi unidic-lite
```

If using for a retrieval task, you must prefix your query with `"関連記事を取得するために使用できるこの文の表現を生成します: "`.

### Usage (Sentence-Transformers)

This model is best used through [sentence-transformers](https://www.SBERT.net). If you don't have it, it's easy to install:

```
pip install -U sentence-transformers
```

Then you can use the model like this:

```python
from sentence_transformers import SentenceTransformer
sentences = ["こんにちは、世界！", "文埋め込み最高！文埋め込み最高と叫びなさい", "極度乾燥しなさい"]

model = SentenceTransformer('bclavie/fio-base-japanese-v0.1')
embeddings = model.encode(sentences)
print(embeddings)
```


### Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

```python
from transformers import AutoTokenizer, AutoModel
import torch


def cls_pooling(model_output, attention_mask):
    return model_output[0][:,0]


# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)
```

## Citing & Authors

```@misc{
  bclavie-fio-embeddings,
  author = {Benjamin Clavié},
  title = {Fio Japanese Embeddings},
  year = {2023},
  howpublished = {\url{https://ben.clavie.eu/fio}}
}```