bclavie's picture
Update README.md
214d387
|
raw
history blame
4.81 kB
---
language:
- ja
pipeline_tag: sentence-similarity
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
- transformers
inference: false
datasets:
- shunk031/JGLUE
- shunk031/jsnli
- hpprc/jsick
- miracl/miracl
- castorini/mr-tydi
- unicamp-dl/mmarco
library_name: sentence-transformers
---
# fio-base-japanese-v0.1
日本語版は近日公開予定です(日本語を勉強中なので、間違いはご容赦ください!)
fio-base-japanese-v0.1 is a proof of concept, and the first release of the Fio family of Japanese embeddings. It is based on [cl-tohoku/bert-base-japanese-v3](https://huggingface.co./cl-tohoku/bert-base-japanese-v3) and trained on limited volumes of data on a single GPU.
For more information, please refer to [my notes on Fio](https://ben.clavie.eu/fio).
#### Datasets
Similarity/Entailment:
- JSTS (train)
- JSNLI (train)
- JNLI (train)
- JSICK (train)
Retrieval:
- MMARCO (Multilingual Marco) (train, 124k sentence pairs, <1% of the full data)
- Mr.TyDI (train)
- MIRACL (train, 50% sample)
- ~~JSQuAD (train, 50% sample, no LLM enhancement)~~ JSQuAD is not used in the released version, to serve as an unseen test set.
#### Results
This is adapted and truncated (to keep only the most popular models) from [oshizo's benchmarking github repo](https://github.com/oshizo/JapaneseEmbeddingEval), please check it out for more information and give it a star as it was very useful!
Italic denotes best model for its size when a smaller model outperforms a bigger one (base/large | 768/1024), bold denotes best overall.
| Model | JSTS valid-v1.1 | JSICK test | MIRACL dev | Average |
|-------------------------------------------------|-----------------|------------|------------|---------|
| bclavie/fio-base-japanese-v0.1 | **_0.863_** | **_0.894_** | 0.718 | _0.825_ |
| cl-nagoya/sup-simcse-ja-base | 0.809 | 0.827 | 0.527 | 0.721 |
| cl-nagoya/sup-simcse-ja-large | _0.831_ | _0.831_ | 0.507 | 0.723 |
| colorfulscoop/sbert-base-ja | 0.742 | 0.657 | 0.254 | 0.551 |
| intfloat/multilingual-e5-base | 0.796 | 0.806 | __0.845__ | 0.816 |
| intfloat/multilingual-e5-large | 0.819 | 0.794 | **0.883** | **_0.832_** |
| pkshatech/GLuCoSE-base-ja | 0.818 | 0.757 | 0.692 | 0.755 |
| text-embedding-ada-002 | 0.790 | 0.789 | 0.7232 | 0.768 |
## Usage
This model requires both `fugashi` and `unidic-lite`:
```
pip install -U fugashi unidic-lite
```
If using for a retrieval task, you must prefix your query with `"関連記事を取得するために使用できるこの文の表現を生成します: "`.
### Usage (Sentence-Transformers)
This model is best used through [sentence-transformers](https://www.SBERT.net). If you don't have it, it's easy to install:
```
pip install -U sentence-transformers
```
Then you can use the model like this:
```python
from sentence_transformers import SentenceTransformer
sentences = ["こんにちは、世界!", "文埋め込み最高!文埋め込み最高と叫びなさい", "極度乾燥しなさい"]
model = SentenceTransformer('bclavie/fio-base-japanese-v0.1')
embeddings = model.encode(sentences)
print(embeddings)
```
### Usage (HuggingFace Transformers)
Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
```python
from transformers import AutoTokenizer, AutoModel
import torch
def cls_pooling(model_output, attention_mask):
return model_output[0][:,0]
# Sentences we want sentence embeddings for
sentences = ['This is an example sentence', 'Each sentence is converted']
# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}')
model = AutoModel.from_pretrained('{MODEL_NAME}')
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
# Compute token embeddings
with torch.no_grad():
model_output = model(**encoded_input)
# Perform pooling. In this case, cls pooling.
sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
print("Sentence embeddings:")
print(sentence_embeddings)
```
## Citing & Authors
```@misc{
bclavie-fio-embeddings,
author = {Benjamin Clavié},
title = {Fio Japanese Embeddings},
year = {2023},
howpublished = {\url{https://ben.clavie.eu/fio}}
}```