NaverHustQA/viLegal_bi

This is an encoder model for Vietnamese legal domain: It maps legal queries & contexts to a 768 dimensional dense vector space and can be used for information retrieval.

We use vinai/phobert-base-v2 as the pre-trained backbone.

Usage (HuggingFace Transformers)

You can use the model like below (Remember to word-segment inputs first):

from transformers import AutoTokenizer, AutoModel
import torch

#CLS Pooling
def cls_pooling(model_output):
    return model_output['last_hidden_state'][:,0,:]

# Sentences we want sentence embeddings, we could use pyvi, underthesea, RDRSegment to segment words
sentences = ['Uống rượu lái_xe bị phạt bao_nhiêu tiền ?', 'Bao_nhiêu tuổi phải làm CCCD ?', 'Uống rượu lái_xe bị phạt 500,000 đồng .']

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('NaverHustQA/viLegal_bi')
model = AutoModel.from_pretrained('NaverHustQA/viLegal_bi')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling.
sentence_embeddings = cls_pooling(model_output)

print("Sentence embeddings:")
print(sentence_embeddings)

Training

You can find full information of our training methods and datasets in our report.

Authors

Le Thanh Huong, Nguyen Nhat Quang.

Downloads last month
18
Safetensors
Model size
135M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support sentence-similarity models for generic library.