mLLM multilingual

AI & ML interests

Multilingual Datasets for everyone

Recent Activity

multilingual's activity

dvilasuero 
posted an update 19 days ago
view post
Post
2257
🌐 Announcing Global-MMLU: an improved MMLU Open dataset with evaluation coverage across 42 languages, built with Argilla and the Hugging Face community.

Global-MMLU is the result of months of work with the goal of advancing Multilingual LLM evaluation. It's been an amazing open science effort with collaborators from Cohere For AI, Mila - Quebec Artificial Intelligence Institute, EPFL, Massachusetts Institute of Technology, AI Singapore, National University of Singapore, KAIST, Instituto Superior Técnico, Carnegie Mellon University, CONICET, and University of Buenos Aires.

🏷️ +200 contributors used Argilla MMLU questions where regional, dialect, or cultural knowledge was required to answer correctly. 85% of the questions required Western-centric knowledge!

Thanks to this annotation process, the open dataset contains two subsets:

1. 🗽 Culturally Agnostic: no specific regional, cultural knowledge is required.
2. ⚖️ Culturally Sensitive: requires dialect, cultural knowledge or geographic knowledge to answer correctly.

Moreover, we provide high quality translations of 25 out of 42 languages, thanks again to the community and professional annotators leveraging Argilla on the Hub.

I hope this will ensure a better understanding of the limitations and challenges for making open AI useful for many languages.

Dataset: CohereForAI/Global-MMLU
dvilasuero 
posted an update about 1 month ago
tomaarsen 
posted an update about 1 month ago
view post
Post
5288
I just released Sentence Transformers v3.3.0 & it's huge! 4.5x speedup for CPU with OpenVINO int8 static quantization, training with prompts for a free perf. boost, PEFT integration, evaluation on NanoBEIR, and more! Details:

1. We integrate Post-Training Static Quantization using OpenVINO, a very efficient solution for CPUs that processes 4.78x as many texts per second on average, while only hurting performance by 0.36% on average. There's a new export_static_quantized_openvino_model method to quantize a model.

2. We add the option to train with prompts, e.g. strings like "query: ", "search_document: " or "Represent this sentence for searching relevant passages: ". It's as simple as using the prompts argument in SentenceTransformerTrainingArguments. Our experiments show that you can easily reach 0.66% to 0.90% relative performance improvement on NDCG@10 at no extra cost by adding "query: " before each training query and "document: " before each training answer.

3. Sentence Transformers now supports training PEFT adapters via 7 new methods for adding new adapters or loading pre-trained ones. You can also directly load a trained adapter with SentenceTransformer as if it's a normal model. Very useful for e.g. 1) training multiple adapters on 1 base model, 2) training bigger models than otherwise possible, or 3) cheaply hosting multiple models by switching multiple adapters on 1 base model.

4. We added easy evaluation on NanoBEIR, a subset of BEIR a.k.a. the MTEB Retrieval benchmark. It contains 13 datasets with 50 queries and up to 10k documents each. Evaluation is fast, and can easily be done during training to track your model's performance on general-purpose information retrieval tasks.

Additionally, we also deprecate Python 3.8, add better compatibility with Transformers v4.46.0, and more. Read the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.3.0
dvilasuero 
posted an update about 2 months ago
view post
Post
683
Build datasets for AI on the Hugging Face Hub—10x easier than ever!

Today, I'm excited to share our biggest feature since we joined Hugging Face.

Here’s how it works:

1. Pick a dataset—upload your own or choose from 240K open datasets.
2. Paste the Hub dataset ID into Argilla and set up your labeling interface.
3. Share the URL with your team or the whole community!

And the best part? It’s:
- No code – no Python needed
- Integrated – all within the Hub
- Scalable – from solo labeling to 100s of contributors

I am incredibly proud of the team for shipping this after weeks of work and many quick iterations.

Let's make this sentence obsolete: "Everyone wants to do the model work, not the data work."


Read, share, and like the HF blog post:
https://huggingface.co./blog/argilla-ui-hub
dvilasuero 
posted an update 2 months ago
view post
Post
989
Big news! You can now build strong ML models without days of human labelling

You simply:
- Define your dataset, including annotation guidelines, labels and fields
- Optionally label some records manually.
- Use an LLM to auto label your data with a human (you? your team?) in the loop!

Get started with this blog post:
https://huggingface.co./blog/sdiazlor/custom-text-classifier-ai-human-feedback
tomaarsen 
posted an update 3 months ago
view post
Post
6829
📣 Sentence Transformers v3.2.0 is out, marking the biggest release for inference in 2 years! 2 new backends for embedding models: ONNX (+ optimization & quantization) and OpenVINO, allowing for speedups up to 2x-3x AND Static Embeddings for 500x speedups at 10-20% accuracy cost.

1️⃣ ONNX Backend: This backend uses the ONNX Runtime to accelerate model inference on both CPU and GPU, reaching up to 1.4x-3x speedup depending on the precision. We also introduce 2 helper methods for optimizing and quantizing models for (much) faster inference.
2️⃣ OpenVINO Backend: This backend uses Intel their OpenVINO instead, outperforming ONNX in some situations on CPU.

Usage is as simple as SentenceTransformer("all-MiniLM-L6-v2", backend="onnx"). Does your model not have an ONNX or OpenVINO file yet? No worries - it'll be autoexported for you. Thank me later 😉

🔒 Another major new feature is Static Embeddings: think word embeddings like GLoVe and word2vec, but modernized. Static Embeddings are bags of token embeddings that are summed together to create text embeddings, allowing for lightning-fast embeddings that don't require any neural networks. They're initialized in one of 2 ways:

1️⃣ via Model2Vec, a new technique for distilling any Sentence Transformer models into static embeddings. Either via a pre-distilled model with from_model2vec or with from_distillation where you do the distillation yourself. It'll only take 5 seconds on GPU & 2 minutes on CPU, no dataset needed.
2️⃣ Random initialization. This requires finetuning, but finetuning is extremely quick (e.g. I trained with 3 million pairs in 7 minutes). My final model was 6.6% worse than bge-base-en-v1.5, but 500x faster on CPU.

Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.2.0
Documentation on Speeding up Inference: https://sbert.net/docs/sentence_transformer/usage/efficiency.html
  • 1 reply
·
tomaarsen 
posted an update 3 months ago
view post
Post
1995
I've just shipped the Sentence Transformers v3.1.1 patch release, fixing the hard negatives mining utility for some models. This utility is extremely useful to get more performance out of your embedding training data.

⛏ Hard negatives are texts that are rather similar to some anchor text (e.g. a query), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
mine_hard_negatives docs: https://sbert.net/docs/package_reference/util.html#sentence_transformers.util.mine_hard_negatives

🔓 Beyond that, this release removes the numpy<2 restriction from v3.1.0. This was previously required for Windows as not all third-party libraries were updated to support numpy v2. With Sentence Transformers, you can now choose v1 or v2 of numpy.

Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.1

I'm looking forward to releasing v3.2, I have some exciting things planned 🚀
dvilasuero 
posted an update 3 months ago
view post
Post
402
Explore FinePersonas, visually with Argilla and black-forest-labs/FLUX.1-schnell


Excited to share this space where the community can explore a tiny subset of FinePersonas

argilla/finepersonas


Dataset built with distilabel and Free Serveless endpoints

This is just a first step towards more interesting experiments with FinePersonas, for example can we use it to assess biases in text2image models?

If you have ideas I'd love to hear them in the comments!

tomaarsen 
posted an update 3 months ago
view post
Post
2037
🎉SetFit v1.1.0 is out! Training efficient classifiers on CPU or GPU now uses the Sentence Transformers Trainer, and we resolved a lot of issues caused by updates of third-party libraries (like Transformers). Details:

Training a SetFit classifier model consists of 2 phases:
1. Finetuning a Sentence Transformer embedding model
2. Training a Classifier to map embeddings -> classes

🔌The first phase now uses the SentenceTransformerTrainer that was introduced in the Sentence Transformers v3 update. This brings some immediate upsides like MultiGPU support, without any (intended) breaking changes.

➡️ Beyond that, we softly deprecated the "evaluation_strategy" argument in favor of "eval_strategy" (following a Transformers deprecation), and deprecated Python 3.7. In return, we add official support for Python 3.11 and 3.12.

✨ There's some more minor changes too, like max_steps and eval_max_steps now being a hard limit instead of an approximate one, training/validation losses now logging nicely in Notebooks, and the "device" parameter no longer being ignored in some situations.

Check out the full release notes here: https://github.com/huggingface/setfit/releases/tag/v1.1.0
Or read the documentation: https://huggingface.co./docs/setfit
Or check out the public SetFit models for inspiration: https://huggingface.co./models?library=setfit&sort=created

P.s. the model in the code snippet trained in 1 minute and it can classify ~6000 sentences per second on my GPU.
tomaarsen 
posted an update 3 months ago
view post
Post
3743
🚀 Sentence Transformers v3.1 is out! Featuring a hard negatives mining utility to get better models out of your data, a new strong loss function, training with streaming datasets, custom modules, bug fixes, small additions and docs changes. Here's the details:

⛏ Hard Negatives Mining Utility: Hard negatives are texts that are rather similar to some anchor text (e.g. a question), but are not the correct match. They're difficult for a model to distinguish from the correct answer, often resulting in a stronger model after training.
📉 New loss function: This loss function works very well for symmetric tasks (e.g. clustering, classification, finding similar texts/paraphrases) and a bit less so for asymmetric tasks (e.g. question-answer retrieval).
💾 Streaming datasets: You can now train with the datasets.IterableDataset, which doesn't require downloading the full dataset to disk before training. As simple as "streaming=True" in your "datasets.load_dataset".
🧩 Custom Modules: Model authors can now customize a lot more of the components that make up Sentence Transformer models, allowing for a lot more flexibility (e.g. multi-modal, model-specific quirks, etc.)
✨ New arguments to several methods: encode_multi_process gets a progress bar, push_to_hub can now be done to different branches, and CrossEncoders can be downloaded to specific cache directories.
🐛 Bug fixes: Too many to name here, check out the release notes!
📝 Documentation: A particular focus on clarifying the batch samplers in the Package Reference this release.

Check out the full release notes here ⭐: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.1.0

I'm very excited to hear your feedback, and I'm looking forward to the future changes that I have planned, such as ONNX inference! I'm also open to suggestions for new features: feel free to send me your ideas.
·
tomaarsen 
posted an update 6 months ago
view post
Post
3922
@Omartificial-Intelligence-Space has trained and released 6 Arabic embedding models for semantic similarity. 4 of them outperform all previous models on the STS17 Arabic-Arabic task!

📚 Trained on a large dataset of 558k Arabic triplets translated from the AllNLI triplet dataset: Omartificial-Intelligence-Space/Arabic-NLi-Triplet
6️⃣ 6 different base models: AraBERT, MarBERT, LaBSE, MiniLM, paraphrase-multilingual-mpnet-base, mpnet-base, ranging from 109M to 471M parameters.
🪆 Trained with a Matryoshka loss, allowing you to truncate embeddings with minimal performance loss: smaller embeddings are faster to compare.
📈 Outperforms all commonly used multilingual models like intfloat/multilingual-e5-large, sentence-transformers/paraphrase-multilingual-mpnet-base-v2, and sentence-transformers/LaBSE.

Check them out here:
- Omartificial-Intelligence-Space/Arabic-mpnet-base-all-nli-triplet
- Omartificial-Intelligence-Space/Arabic-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-labse-Matryoshka
- Omartificial-Intelligence-Space/Marbert-all-nli-triplet-Matryoshka
- Omartificial-Intelligence-Space/Arabic-MiniLM-L12-v2-all-nli-triplet
Or the collection with all: Omartificial-Intelligence-Space/arabic-matryoshka-embedding-models-666f764d3b570f44d7f77d4e

My personal favourite is likely Omartificial-Intelligence-Space/Arabert-all-nli-triplet-Matryoshka: a very efficient 135M parameters & scores #1 on mteb/leaderboard.
  • 1 reply
·
dvilasuero 
posted an update 6 months ago
view post
Post
8073
Today is a huge day in Argilla’s history. We couldn’t be more excited to share this with the community: we’re joining Hugging Face!

We’re embracing a larger mission, becoming part of a brilliant and kind team and a shared vision about the future of AI.

Over the past year, we’ve been collaborating with Hugging Face on countless projects: launching partner of Docker Spaces, empowering the community to clean Alpaca translations into Spanish and other languages, launching argilla/notus-7b-v1 building on Zephyr’s learnings, the Data is Better Together initiative with hundreds of community contributors, or releasing argilla/OpenHermesPreferences, one of the largest open preference tuning datasets

After more than 2,000 Slack messages and over 60 people collaborating for over a year, it already felt like we were part of the same team, pushing in the same direction. After a week of the smoothest transition you can imagine, we’re now the same team.

To those of you who’ve been following us, this won’t be a huge surprise, but it will be a big deal in the coming months. This acquisition means we’ll double down on empowering the community to build and collaborate on high quality datasets, we’ll bring full support for multimodal datasets, and we’ll be in a better place to collaborate with the Open Source AI community. For enterprises, this means that the Enterprise Hub will unlock highly requested features like single sign-on and integration with Inference Endpoints.

As a founder, I am proud of the Argilla team. We're now part of something bigger and a larger team but with the same values, culture, and goals. Grateful to have shared this journey with my beloved co-founders Paco and Amélie.

Finally, huge thanks to the Chief Llama Officer @osanseviero for sparking this and being such a great partner during the acquisition process.

Would love to answer any questions you have so feel free to add them below!
·
tomaarsen 
posted an update 7 months ago
view post
Post
3306
I just published Sentence Transformers v3.0.1: the first patch release since v3 from last week. It introduces gradient checkpointing, pushing model checkpoints to Hugging Face while training, model card improvements and fixes. Details:

1️⃣ Gradient checkpointing allows for much less memory usage at a cost of ~20% training speed. Seems to allow for higher batch sizes, which is quite important for loss functions with in-batch negatives.
2️⃣ You can specify args.push_to_hub=True and args.hub_model_id to upload your model checkpoints to Hugging Face while training. It also uploads your emissions (if codecarbon is installed) and your Tensorboard logs (if tensorboard is installed)
3️⃣ Model card improvements: improved automatic widget examples, better tags, and the default of "sentence_transformers_model_id" now gets replaced when possible.
4️⃣ Several evaluator fixes, see release notes for details.
5️⃣ Fixed a bug with MatryoshkaLoss throwing an error if the supplied Matryoshka dimensions are ascending instead of descending.
6️⃣ Full Safetensors support; even the uncommon modules can now save and load "model.safetensors" files: no more pickle risks.

Check out the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.1

And let me know what kind of features you'd like to see next! I have some plans already (ONNX, Sparse models, ColBERT, PEFT), but I don't yet know how I should prioritize everything.
·
tomaarsen 
posted an update 7 months ago
view post
Post
1982
‼️Sentence Transformers v3.0 is out! You can now train and finetune embedding models with multi-GPU training, bf16 support, loss logging, callbacks & much more. I also release 50+ datasets to train on.

1️⃣ Training Refactor
Embedding models can now be trained using an extensive trainer with a lot of powerful features:
- MultiGPU Training (Data Parallelism (DP) and Distributed Data Parallelism (DDP))
- bf16 training support; loss logging
- Evaluation datasets + evaluation loss
- Improved callback support + an excellent Weights & Biases integration
- Gradient checkpointing, gradient accumulation
- Improved model card generation
- Resuming from a training checkpoint without performance loss
- Hyperparameter Optimization
and much more!
Read my detailed blogpost to learn about the components that make up this new training approach: https://huggingface.co./blog/train-sentence-transformers

2️⃣ Similarity Score
Not sure how to compare embeddings? Don't worry, you can now use model.similarity(embeddings1, embeddings2) and you'll get your similarity scores immediately. Model authors can specify their desired similarity score, so you don't have to worry about it anymore!

3️⃣ Additional Kwargs
Sentence Transformers relies on various Transformers instances (AutoModel, AutoTokenizer, AutoConfig), but it was hard to provide valuable keyword arguments to these (like 'torch_dtype=torch.bfloat16' to load a model a lower precision for 2x inference speedup). This is now easy!

4️⃣ Hyperparameter Optimization
Sentence Transformers now ships with HPO, allowing you to effectively choose your hyperparameters for your data and task.

5️⃣ Dataset Release
To help you out with finetuning models, I've released 50+ ready-to-go datasets that can be used with training or finetuning embedding models: sentence-transformers/embedding-model-datasets-6644d7a3673a511914aa7552

Full release notes: https://github.com/UKPLab/sentence-transformers/releases/tag/v3.0.0
tomaarsen 
posted an update 8 months ago
view post
Post
2387
NuMind has just released 3 new state-of-the-art GLiNER models for Named Entity Recognition/Information Extraction. These GLiNER models allow you to specify any label that you want, and it'll find spans in the text corresponding to your label. It's been shown to work quite well on unusual domains, e.g. celestial entities in my picture.

There are 3 models released:
- numind/NuNER_Zero:
The primary model, SOTA & can detect really long entities.
- numind/NuNER_Zero-span:
Slightly better performance than NuNER Zero, but can't detect entities longer than 12 tokens.
- numind/NuNER_Zero-4k:
Slightly worse than NuNER Zero, but has a context length of 4k tokens.

Some more details about these models in general:
- They are *really* small, orders of magnitude smaller than LLMs, which don't reach this level of performance.
- Because they're small - they're fast: <1s per sentence on free GPUs.
- They have an MIT license: free commercial usage.

Try out the demo here: https://huggingface.co./spaces/numind/NuZero
Or check out all of the models here: numind/nunerzero-zero-shot-ner-662b59803b9b438ff56e49e2

If there's ever a need for me to extract some information from any text: I'll be using these. Great work @Serega6678 !
  • 3 replies
·
tomaarsen 
posted an update 8 months ago
view post
Post
2737
I've just stumbled upon some excellent work on (🇫🇷 French) retrieval models by @antoinelouis . Kudos to him!

- French Embedding Models: https://huggingface.co./collections/antoinelouis/dense-single-vector-bi-encoders-651523c0c75a3d4c44fc864d
- French Reranker Models: antoinelouis/cross-encoder-rerankers-651523f16efa656d1788a239
- French Multi-vector Models: https://huggingface.co./collections/antoinelouis/dense-multi-vector-bi-encoders-6589a8ee6b17c06872e9f075
- Multilingual Models: https://huggingface.co./collections/antoinelouis/modular-retrievers-65d53d0db64b1d644aea620c

A lot of these models use the MS MARCO Hard Negatives dataset, which I'm currently reformatting to be more easily usable. Notably, they should work out of the box without any pre-processing for training embedding models in the upcoming Sentence Transformers v3.
tomaarsen 
posted an update 8 months ago
view post
Post
3172
🚀 Sentence Transformers v2.7.0 is out! Featuring a new loss function, easier Matryoshka model inference & evaluation, CrossEncoder improvements & Intel Gaudi2 Accelerator support. Details:

1️⃣ A new loss function: CachedGISTEmbedLoss
This loss function is a combination of CachedMultipleNegativesRankingLoss and the GISTEmbedLoss, both of which are already excellent. The caching mechanism allows for much higher batch sizes with constant memory usage, which boosts training performance. The GIST part introduces a guide model to guide the in-batch negative sample selection. This prevents false negatives, resulting in a stronger training signal.

2️⃣ Automatic Matryoshka model truncation
Matryoshka models produce embeddings that are still useful after truncation. However, this truncation always had to be done manually, until now! We've added a truncate_dim option to the Sentence Transformer constructor. This also allows truncation when using HuggingFaceEmbeddings from LlamaIndex or LangChain.

3️⃣ Additionally, you can now specify truncate_dim in evaluators to get the performance after truncation. (Hint: it's surprisingly good, even for models not trained with MatryoshkaLoss, and it can speed up e.g. clustering, retrieval, etc.)

4️⃣ CrossEncoder improvements
The CrossEncoder now supports 'push_to_hub' to upload trained reranker models to Hugging Face. Additionally, CrossEncoders now support trust_remote_code to load models with custom modelling code.

5️⃣ Inference on Intel Gaudi2
If you have an Intel Gaudi2 Accelerator, Sentence Transformers now uses it automatically for even faster inference. No changes are necessary to your code, the device is automatically detected!

Check out the release notes for all of the details: https://github.com/UKPLab/sentence-transformers/releases/tag/v2.7.0

I'm very excited for the upcoming releases: I'm making great progress with a notable v3 refactor that should heavily improve the training process for embedding models!
  • 2 replies
·
aari1995 
posted an update 9 months ago
view post
Post
3281
ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH

mLLM - first release:
orca_dpo_pairs by Intel (translated into 7 languages)

ARABIC CHINESE FRENCH GERMAN RUSSIAN SPANISH TURKISH

Upcoming:
- more datasets
- cleaning steps
- a blogpost
- stay updated at https://hf.co/multilingual

multilingual/orca_dpo_pairs
·