Quants?
#1
by
ctranslate2-4you
- opened
Is this something that could be quantized perhaps and do you know if it's compatible with llama.CPP. And would it work with Transformers.JS and ONNX runtime in general?
Hello!
It's not compatible with Llama.cpp I'm afraid; that's likely not going to happen either, this model is just too different.
Support for Transformers.js is more plausible (cc
@Xenova
) because we already have support for Model2Vec models, and these are quite similar. However, I can't be sure that it'll be added.
- Tom Aarsen
Done: https://huggingface.co./sentence-transformers/static-retrieval-mrl-en-v1/discussions/2
Conversion code:
import torch
from sentence_transformers import SentenceTransformer
class WrappedModel(torch.nn.Module):
def __init__(self, m):
super().__init__()
self.embedding = m[0].embedding
def forward(self, input_ids, attention_mask):
indices = input_ids[attention_mask == 1]
offsets = torch.cat([torch.tensor([0]), attention_mask.sum(dim=1)[:-1].cumsum(dim=0)])
return self.embedding(indices, offsets)
shape = (3, 4)
input_ids = torch.tensor([1, 2, 3, 4, 5, 6, -1, -1, 1, 1, 1, 0]).view(shape)
attention_mask = torch.tensor([1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0]).view(shape)
model = SentenceTransformer("tomaarsen/static-retrieval-mrl-en-v1")
wrapped = WrappedModel(model) # test forward pass
# Export the model
torch.onnx.export(wrapped,
(input_ids, attention_mask),
"model.onnx",
export_params=True,
opset_version=14,
do_constant_folding=True,
input_names = ['input_ids', 'attention_mask'],
output_names = ['sentence_embedding'],
dynamic_axes={
'input_ids' : {0 : 'batch_size', 1: 'sequence_length'},
'attention_mask' : {0 : 'batch_size', 1: 'sequence_length'},
'sentence_embedding' : {0 : 'batch_size'},
})