Note

This an updated version of KennethTM/MiniLM-L6-danish-encoder. This version is just trained on more data (GooAQ dataset translated to Danish) and is otherwise the same

MiniLM-L6-danish-encoder

This is a lightweight (~22 M parameters) sentence-transformers model for Danish NLP: It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search.

The maximum sequence length is 512 tokens.

The model was not pre-trained from scratch but adapted from the English version of sentence-transformers/all-MiniLM-L6-v2 with a Danish tokenizer.

Trained on ELI5 and SQUAD data machine translated from English to Danish.

Usage (Sentence-Transformers)

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

Then you can use the model like this:

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# Given a query
query = ['Kører der cykler på vejen?']

# And two passages
passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.', 
           'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.']

# Compute embeddings
model = SentenceTransformer("KennethTM/MiniLM-L6-danish-encoder-v2")
query_embeddings = model.encode(query)
passage_embeddings = model.encode(passage)

# To find most relevant passage for the query (closer to 1 means more similar)
cosine_scores = cos_sim(query_embeddings, passage_embeddings)
print(cosine_scores)

Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2")
model = AutoModel.from_pretrained("KennethTM/MiniLM-L6-danish-encoder-v2")

# Given a query
query = ['Kører der cykler på vejen?']

# And two passages
passage = ['I Danmark er cykler et almindeligt transportmiddel, og de har lige så stor ret til at bruge vejene som bilister. Cyklister skal dog følge færdselsreglerne og vise hensyn til andre trafikanter.', 
           'Solen skinner, og himlen er blå. Der er ingen vind, og temperaturen er perfekt. Det er den perfekte dag til at tage en tur på landet og nyde den friske luft.']

# Tokenize sentences
query_encoded = tokenizer(query, padding=True, truncation=True, return_tensors='pt')
passage_encoded = tokenizer(passage, padding=True, truncation=True, return_tensors='pt')

# Compute embeddings
with torch.no_grad():
    query_features = model(**query_encoded)
    passage_features  = model(**passage_encoded)

# Perform pooling
query_embeddings = mean_pooling(query_features, query_encoded['attention_mask'])
passage_embeddings = mean_pooling(passage_features, passage_encoded['attention_mask'])

# To find most relevant passage for the query (closer to 1 means more similar)
cosine_scores = F.cosine_similarity(query_embeddings, passage_embeddings)
print(cosine_scores)
Downloads last month
112
Safetensors
Model size
22.7M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train KennethTM/MiniLM-L6-danish-encoder-v2