ember-v1
This model has been trained on an extensive corpus of text pairs that encompass a broad spectrum of domains, including finance, science, medicine, law, and various others. During the training process, we incorporated techniques derived from the RetroMAE and SetFit research papers.
Plans
- The research paper will be published soon.
- The v2 of the model is currently in development and will feature an extended maximum sequence length of 4,000 tokens.
Usage
Use with transformers:
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
"This is an example sentence",
"Each sentence is converted"
]
tokenizer = AutoTokenizer.from_pretrained("llmrails/ember-v1")
model = AutoModel.from_pretrained("llmrails/ember-v1")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Use with sentence-transformers:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
sentences = [
"This is an example sentence",
"Each sentence is converted"
]
model = SentenceTransformer('llmrails/ember-v1')
embeddings = model.encode(sentences)
print(cos_sim(embeddings[0], embeddings[1]))
Massive Text Embedding Benchmark (MTEB) Evaluation
Our model achieve state-of-the-art performance on MTEB leaderboard
Model Name | Dimension | Sequence Length | Average (56) |
---|---|---|---|
ember-v1 | 1024 | 512 | 63.54 |
bge-large-en-v1.5 | 1024 | 512 | 63.23 |
bge-base-en-v1.5 | 768 | 512 | 63.05 |
text-embedding-ada-002 | 1536 | 8191 | 60.99 |
Limitation
This model exclusively caters to English texts, and any lengthy texts will be truncated to a maximum of 512 tokens.
License
MIT
Citation
@misc{nur2024emberv1,
title={ember-v1: SOTA embedding model},
author={Enrike Nur and Anar Aliyev},
year={2023},
}
- Downloads last month
- 26,183
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social
visibility and check back later, or deploy to Inference Endpoints (dedicated)
instead.
Model tree for llmrails/ember-v1
Spaces using llmrails/ember-v1 6
Evaluation results
- accuracy on MTEB AmazonCounterfactualClassification (en)test set self-reported76.060
- ap on MTEB AmazonCounterfactualClassification (en)test set self-reported38.760
- f1 on MTEB AmazonCounterfactualClassification (en)test set self-reported69.882
- accuracy on MTEB AmazonPolarityClassificationtest set self-reported91.977
- ap on MTEB AmazonPolarityClassificationtest set self-reported88.635
- f1 on MTEB AmazonPolarityClassificationtest set self-reported91.952
- accuracy on MTEB AmazonReviewsClassification (en)test set self-reported47.938
- f1 on MTEB AmazonReviewsClassification (en)test set self-reported47.583
- map_at_1 on MTEB ArguAnatest set self-reported41.252
- map_at_10 on MTEB ArguAnatest set self-reported56.567