--- language: - ja pipeline_tag: sentence-similarity tags: - sentence-transformers - feature-extraction - sentence-similarity - transformers --- # fio-base-japanese-v0.1 日本語版は近日公開予定です(日本語を勉強中なので、間違いはご容赦ください!) fio-base-japanese-v0.1 is a proof of concept, and the first release of the Fio family of Japanese embeddings. It is based on [cl-tohoku/bert-base-japanese-v3](https://huggingface.co./cl-tohoku/bert-base-japanese-v3) and trained on limited volumes of data on a single GPU. For more information, please refer to [my notes on Fio](https://ben.clavie.eu/fio). #### Datasets Similarity/Entailment: - JSTS (train) - JSNLI (train) - JNLI (train) - JSICK (train) Retrieval: - MMARCO (Multilingual Marco) (train, 124k sentence pairs, <1% of the full data) - Mr.TyDI (train) - MIRACL (train, 50% sample) - ~~JSQuAD (train, 50% sample, no LLM enhancement)~~ JSQuAD is not used in the released version, to serve as an unseen test set. #### Results This is adapted and truncated (to keep only the most popular models) from [oshizo's benchmarking github repo](https://github.com/oshizo/JapaneseEmbeddingEval), please check it out for more information and give it a star as it was very useful! Italic denotes best model for its size when a smaller model outperforms a bigger one (base/large | 768/1024), bold denotes best overall. | Model | JSTS valid-v1.1 | JSICK test | MIRACL dev | Average | |-------------------------------------------------|-----------------|------------|------------|---------| | bclavie/fio-base-japanese-v0.1 | **_0.863_** | **_0.894_** | 0.718 | _0.825_ | | cl-nagoya/sup-simcse-ja-base | 0.809 | 0.827 | 0.527 | 0.721 | | cl-nagoya/sup-simcse-ja-large | _0.831_ | _0.831_ | 0.507 | 0.723 | | colorfulscoop/sbert-base-ja | 0.742 | 0.657 | 0.254 | 0.551 | | intfloat/multilingual-e5-base | 0.796 | 0.806 | __0.845__ | 0.816 | | intfloat/multilingual-e5-large | 0.819 | 0.794 | **0.883** | **_0.832_** | | pkshatech/GLuCoSE-base-ja | 0.818 | 0.757 | 0.692 | 0.755 | | text-embedding-ada-002 | 0.790 | 0.789 | 0.7232 | 0.768 | ## Usage This model requires both `fugashi` and `unidic-lite`: ``` pip install -U fugashi unidic-lite ``` If using for a retrieval task, you must prefix your query with `"関連記事を取得するために使用できるこの文の表現を生成します: "`. ### Usage (Sentence-Transformers) This model is best used through [sentence-transformers](https://www.SBERT.net). If you don't have it, it's easy to install: ``` pip install -U sentence-transformers ``` Then you can use the model like this: ```python from sentence_transformers import SentenceTransformer sentences = ["こんにちは、世界!", "文埋め込み最高!文埋め込み最高と叫びなさい", "極度乾燥しなさい"] model = SentenceTransformer('bclavie/fio-base-japanese-v0.1') embeddings = model.encode(sentences) print(embeddings) ``` ### Usage (HuggingFace Transformers) Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings. ```python from transformers import AutoTokenizer, AutoModel import torch def cls_pooling(model_output, attention_mask): return model_output[0][:,0] # Sentences we want sentence embeddings for sentences = ['This is an example sentence', 'Each sentence is converted'] # Load model from HuggingFace Hub tokenizer = AutoTokenizer.from_pretrained('{MODEL_NAME}') model = AutoModel.from_pretrained('{MODEL_NAME}') # Tokenize sentences encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt') # Compute token embeddings with torch.no_grad(): model_output = model(**encoded_input) # Perform pooling. In this case, cls pooling. sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask']) print("Sentence embeddings:") print(sentence_embeddings) ``` ## Citing & Authors ```@misc{ bclavie-fio-embeddings, author = {Benjamin Clavié}, title = {Fio Japanese Embeddings}, year = {2023}, howpublished = {\url{https://ben.clavie.eu/fio}} }```