infly/inf-wse-v2-base-zh

INF Word-level Sparse Embedding v2 (INF-WSE-v2)

INF-WSE-v2 is the latest version of the word-level sparse embedding model developed by INF TECH.

Compared to INF-WSE-v1, INF-WSE-v2 continues to be pretrained on the Wudao corpus (from roformer_chinese_base) and introduces enhanced token rewriting capabilities. These advancements improve the model's ability to generate more accurate, adaptable, and contextually relevant text embeddings, with a particular focus on Chinese language processing.

Key Features:

Optimized for Retrieval: INF-WSE-v2 is specifically designed for information retrieval tasks. By leveraging sparse embeddings, the model ensures efficient matching between queries and documents, making it ideal for semantic search, ranking, and other retrieval scenarios where both speed and accuracy are essential.
Token Rewriting Capability: A new token rewriting feature allows INF-WSE-v2 to dynamically modify tokens during the embedding process. This improves the model’s ability to produce more accurate and contextually relevant representations, especially when dealing with complex linguistic structures and nuances in Chinese text.
Sparse Representation for Efficiency: Unlike traditional dense embeddings, which have a fixed dimensionality, INF-WSE-v2 uses sparse embeddings where most dimensions are set to zero. This results in embeddings where only the most significant dimensions are non-zero, reducing computational load while maintaining high accuracy for retrieval tasks.

Usage

Transformers

Infer embeddings

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

queries = ['电脑一体机由什么构成？', '什么是掌上电脑？']
documents = [
    '电脑一体机，是由一台显示器、一个电脑键盘和一个鼠标组成的电脑。',
    '掌上电脑是一种运行在嵌入式操作系统和内嵌式应用软件之上的、小巧、轻便、易带、实用、价廉的手持式计算设备。',
]
input_texts = queries + documents

tokenizer = AutoTokenizer.from_pretrained("infly/inf-wse-v2-base-zh", trust_remote_code=True, use_fast=False)  # Fast tokenizer has not been supported yet
model = AutoModelForMaskedLM.from_pretrained("infly/inf-wse-v2-base-zh", trust_remote_code=True)
model.eval()

max_length = 512

input_batch = tokenizer(input_texts, padding=True, max_length=max_length, truncation=True, return_tensors="pt")
with torch.no_grad():
    embeddings = model(input_batch['input_ids'], input_batch['attention_mask'], return_sparse=False)  # if return_sparse=True, return sparse tensor, else return dense tensor

scores = embeddings[:2] @ embeddings[2:].T
print(scores.tolist())
# [[25.137710571289062, 9.891149520874023], [11.703001976013184, 30.97362518310547]]

Convert embeddings to lexical weights

from collections import OrderedDict
def convert_embeddings_to_weights(embeddings, tokenizer):
    values, indices = torch.sort(embeddings, dim=-1, descending=True)
    
    token2weight = []
    for i in range(embeddings.size(0)):
        token2weight.append(OrderedDict())

        non_zero_mask = values[i] != 0
        tokens = tokenizer.convert_ids_to_tokens(indices[i][non_zero_mask])
        weights = values[i][non_zero_mask].tolist()

        for token, weight in zip(tokens, weights):
            token2weight[i][token] = weight

    return token2weight

token2weight = convert_embeddings_to_weights(embeddings, tokenizer)
print(token2weight[1])
# OrderedDict([('掌上', 1.9666814804077148), ('电脑', 1.4205719232559204), ('掌中', 1.2688857316970825), ('全称', 1.2548470497131348), ('to', 1.041936993598938), ('台式机', 0.9435897469520569), ('编程语言', 0.8740423917770386), ('pad', 0.8506593108177185), ('手持', 0.835372269153595), ('point', 0.8245767951011658), ('计算机', 0.8100651502609253), ('叫法', 0.8098558187484741), ('手部', 0.7246338725090027), ('手机', 0.6195603013038635), ('micro', 0.5971686244010925), ('电子产品', 0.5647062063217163), ('软件', 0.561561107635498), ('手指', 0.494046688079834), ('technology', 0.47637590765953064), ('pen', 0.4651668071746826), ('virtual', 0.4590775668621063), ('掌心', 0.4538556635379791), ('智能', 0.40049654245376587), ('智慧', 0.3949573338031769), ('touch', 0.38361087441444397), ('指向', 0.3723030686378479), ('移动', 0.3585004508495331), ('事物', 0.34118232131004333), ('电子元件', 0.3282782733440399), ('笔记本', 0.3156297206878662), ('原名', 0.3028894364833832), ('鼠标', 0.28492796421051025), ('android', 0.25649091601371765), ('指', 0.1655425727367401), ('掌握', 0.16021089255809784), ('chi', 0.15045176446437836), ('前臂', 0.11981695145368576), ('book', 0.09273456782102585), ('手掌', 0.07757095992565155), ('按键', 0.06321503221988678), ('小型', 0.05425526574254036), ('一体机', 0.04848058149218559), ('my', 0.03250341862440109), ('psp', 0.01875465363264084), ('跨平台', 0.01767222210764885), ('电脑游戏', 0.005152992904186249)])

Evaluation

C-MTEB Retrieval task

(Chinese Massive Text Embedding Benchmark)

Metric: nDCG@10

Model Name	Max Length	Average	Cmedqa	Covid	Du	Ecom	Medical	MMarco	T2	Video
BM25-zh	-	50.37	13.70	86.58	57.13	44.04	32.08	48.31	60.48	60.64
bge-m3-sparse	512	57.00	24.50	76.09	71.51	50.49	43.93	59.28	71.76	58.43
inf-wse-v1-base-zh	512	61.16	20.51	76.41	79.84	56.78	46.24	66.40	76.50	68.57
inf-wse-v2-base-zh	512	69.15	30.64	79.38	87.12	64.95	56.54	78.80	83.05	72.69

All results, except for BM25, are measured by building the sparse index via Qdrant.

infly
/

inf-wse-v2-base-zh