--- language: - zh base_model: junnyu/roformer_chinese_base tags: - transformers --- ## INF Word-level Sparse Embedding v2 (INF-WSE-v2) **INF-WSE-v2** is the latest version of the word-level sparse embedding model developed by [INF TECH](https://www.infly.cn/en). Compared to [INF-WSE-v1](https://huggingface.co./infly/inf-wse-v1-base-zh), INF-WSE-v2 continues to be pretrained on the [Wudao](https://huggingface.co./datasets/p208p2002/wudao) corpus (from [roformer_chinese_base](https://huggingface.co./junnyu/roformer_chinese_base)) and introduces enhanced token rewriting capabilities. These advancements improve the model's ability to generate more accurate, adaptable, and contextually relevant text embeddings, with a particular focus on Chinese language processing. ### Key Features: - **Optimized for Retrieval**: INF-WSE-v2 is specifically designed for information retrieval tasks. By leveraging sparse embeddings, the model ensures efficient matching between queries and documents, making it ideal for semantic search, ranking, and other retrieval scenarios where both speed and accuracy are essential. - **Token Rewriting Capability**: A new token rewriting feature allows INF-WSE-v2 to dynamically modify tokens during the embedding process. This improves the model’s ability to produce more accurate and contextually relevant representations, especially when dealing with complex linguistic structures and nuances in Chinese text. - **Sparse Representation for Efficiency**: Unlike traditional dense embeddings, which have a fixed dimensionality, INF-WSE-v2 uses sparse embeddings where most dimensions are set to zero. This results in embeddings where only the most significant dimensions are non-zero, reducing computational load while maintaining high accuracy for retrieval tasks. ## Usage ### Transformers #### Infer embeddings ```python import torch from transformers import AutoTokenizer, AutoModelForMaskedLM queries = ['电脑一体机由什么构成?', '什么是掌上电脑?'] documents = [ '电脑一体机,是由一台显示器、一个电脑键盘和一个鼠标组成的电脑。', '掌上电脑是一种运行在嵌入式操作系统和内嵌式应用软件之上的、小巧、轻便、易带、实用、价廉的手持式计算设备。', ] input_texts = queries + documents tokenizer = AutoTokenizer.from_pretrained("infly/inf-wse-v2-base-zh", trust_remote_code=True, use_fast=False) # Fast tokenizer has not been supported yet model = AutoModelForMaskedLM.from_pretrained("infly/inf-wse-v2-base-zh", trust_remote_code=True) model.eval() max_length = 512 input_batch = tokenizer(input_texts, padding=True, max_length=max_length, truncation=True, return_tensors="pt") with torch.no_grad(): embeddings = model(input_batch['input_ids'], input_batch['attention_mask'], return_sparse=False) # if return_sparse=True, return sparse tensor, else return dense tensor scores = embeddings[:2] @ embeddings[2:].T print(scores.tolist()) # [[25.137710571289062, 9.891149520874023], [11.703001976013184, 30.97362518310547]] ``` #### Convert embeddings to lexical weights ```python from collections import OrderedDict def convert_embeddings_to_weights(embeddings, tokenizer): values, indices = torch.sort(embeddings, dim=-1, descending=True) token2weight = [] for i in range(embeddings.size(0)): token2weight.append(OrderedDict()) non_zero_mask = values[i] != 0 tokens = tokenizer.convert_ids_to_tokens(indices[i][non_zero_mask]) weights = values[i][non_zero_mask].tolist() for token, weight in zip(tokens, weights): token2weight[i][token] = weight return token2weight token2weight = convert_embeddings_to_weights(embeddings, tokenizer) print(token2weight[1]) # OrderedDict([('掌上', 1.9666814804077148), ('电脑', 1.4205719232559204), ('掌中', 1.2688857316970825), ('全称', 1.2548470497131348), ('to', 1.041936993598938), ('台式机', 0.9435897469520569), ('编程语言', 0.8740423917770386), ('pad', 0.8506593108177185), ('手持', 0.835372269153595), ('point', 0.8245767951011658), ('计算机', 0.8100651502609253), ('叫法', 0.8098558187484741), ('手部', 0.7246338725090027), ('手机', 0.6195603013038635), ('micro', 0.5971686244010925), ('电子产品', 0.5647062063217163), ('软件', 0.561561107635498), ('手指', 0.494046688079834), ('technology', 0.47637590765953064), ('pen', 0.4651668071746826), ('virtual', 0.4590775668621063), ('掌心', 0.4538556635379791), ('智能', 0.40049654245376587), ('智慧', 0.3949573338031769), ('touch', 0.38361087441444397), ('指向', 0.3723030686378479), ('移动', 0.3585004508495331), ('事物', 0.34118232131004333), ('电子元件', 0.3282782733440399), ('笔记本', 0.3156297206878662), ('原名', 0.3028894364833832), ('鼠标', 0.28492796421051025), ('android', 0.25649091601371765), ('指', 0.1655425727367401), ('掌握', 0.16021089255809784), ('chi', 0.15045176446437836), ('前臂', 0.11981695145368576), ('book', 0.09273456782102585), ('手掌', 0.07757095992565155), ('按键', 0.06321503221988678), ('小型', 0.05425526574254036), ('一体机', 0.04848058149218559), ('my', 0.03250341862440109), ('psp', 0.01875465363264084), ('跨平台', 0.01767222210764885), ('电脑游戏', 0.005152992904186249)]) ``` ## Evaluation ### C-MTEB Retrieval task ([Chinese Massive Text Embedding Benchmark](https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB)) Metric: nDCG@10 | Model Name | Max Length | Average | Cmedqa | Covid | Du | Ecom | Medical | MMarco | T2 | Video | |:---------------------------------------------------------------------:|:----------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:| | [BM25-zh](https://github.com/castorini/pyserini) | - | 50.37 | 13.70 | **86.58** | 57.13 | 44.04 | 32.08 | 48.31 | 60.48 | 60.64 | | [bge-m3-sparse](https://huggingface.co./BAAI/bge-m3) | 512 | 57.00 | 24.50 | 76.09 | 71.51 | 50.49 | 43.93 | 59.28 | 71.76 | 58.43 | | [inf-wse-v1-base-zh](https://huggingface.co./infly/inf-wse-v1-base-zh) | 512 | 61.16 | 20.51 | 76.41 | 79.84 | 56.78 | 46.24 | 66.40 | 76.50 | 68.57 | | **inf-wse-v2-base-zh** | 512 | **69.15** | **30.64** | 79.38 | **87.12** | **64.95** | **56.54** | **78.80** | **83.05** | **72.69** | All results, except for BM25, are measured by building the sparse index via [Qdrant](https://github.com/qdrant/qdrant).