Google just released PaliGemma 2 Mix: new versatile instruction vision language models ๐ฅ
> Three new models: 3B, 10B, 28B with res 224, 448 ๐ > Can do vision language tasks with open-ended prompts, understand documents, and segment or detect anything ๐คฏ
๐ Multimodal > OpenGVLab released InternVideo 2.5 Chat models, new video LMs with long context > AIDC released Ovis2 model family along with Ovis dataset, new vision LMs in different sizes (1B, 2B, 4B, 8B, 16B, 34B), with video and OCR support > ColQwenStella-2b is a multilingual visual retrieval model that is sota in it's size > Hoags-2B-Exp is a new multilingual vision LM with contextual reasoning, long context video understanding
๐ฌ LLMs A lot of math models! > Open-R1 team released OpenR1-Math-220k large scale math reasoning dataset, along with Qwen2.5-220K-Math fine-tuned on the dataset, OpenR1-Qwen-7B > Nomic AI released new Nomic Embed multilingual retrieval model, a MoE with 500 params with 305M active params, outperforming other models > DeepScaleR-1.5B-Preview is a new DeepSeek-R1-Distill fine-tune using distributed RL on math > LIMO is a new fine-tune of Qwen2.5-32B-Instruct on Math
๐ฃ๏ธ Audio > Zonos-v0.1 is a new family of speech recognition models, which contains the model itself and embeddings
๐ผ๏ธ Vision and Image Generation > We have ported DepthPro of Apple to transformers for your convenience! > illustrious-xl-v1.0 is a new illustration generation model
๐ค Robotics > Pi0, first open-source foundation vision-language action model was released in Le Robot (Apache 2.0)
๐ฌ LLMs > Groundbreaking: s1 is simpler approach to test-time scaling, the release comes with small s1K dataset of 1k question-reasoning trace pairs (from Gemini-Thinking Exp) they fine-tune Qwen2.5-32B-Instruct to get s1-32B, outperforming o1-preview on math ๐คฏ s1-32B and s1K is out! > Adyen released DABstep, a new benchmark along with it's leaderboard demo for agents doing data analysis > Krutrim released Krutrim-2 instruct, new 12B model based on NeMo12B trained and aligned on Indic languages, a new multilingual sentence embedding model (based on STSB-XLM-R), and a translation model for Indic languages
๐ Multimodal > PKU released Align-DS-V, a model aligned using their new technique called LLF for all modalities (image-text-audio), along with the dataset Align Anything > OLA-7B is a new any-to-any model by Tencent that can take text, image, video, audio data with context window of 32k tokens and output text and speech in English and Chinese > Krutrim released Chitrarth, a new vision language model for Indic languages and English
๐ผ๏ธ Vision > BiRefNet_HR is a new higher resolution BiRefNet for background removal
๐ฃ๏ธ Audio > kyutai released Hibiki, it's a real-time speech-to-speech translation model ๐คฏ it's available for French-English translation > Krutrim released Dhwani, a new STT model for Indic languages > They also release a new dataset for STT-TTS
๐ผ๏ธ Image Generation > Lumina released Lumina-Image-2.0, a 2B parameter-flow based DiT for text to image generation > Tencent released Hunyuan3D-2, a 3D asset generation model based on DiT and Hunyuan3D-Paint > boreal-hl-v1 is a new boring photorealistic image generation LoRA based on Hunyuan
This week in open AI was ๐ฅ Let's recap! ๐ค merve/january-31-releases-679a10669bd4030090c5de4d LLMs ๐ฌ > Huge: AllenAI released new Tรผlu models that outperform DeepSeek R1 using Reinforcement Learning with Verifiable Reward (RLVR) based on Llama 3.1 405B ๐ฅ > Mistral AI is back to open-source with their "small" 24B models (base & SFT), with Apache 2.0 license ๐ฑ > Alibaba Qwen released their 1M context length models Qwen2.5-Instruct-1M, great for agentic use with Apache 2.0 license ๐ฅ > Arcee AI released Virtuoso-medium, 32.8B LLMs distilled from DeepSeek V3 with dataset of 5B+ tokens > Velvet-14B is a new family of 14B Italian LLMs trained on 10T tokens in six languages > OpenThinker-7B is fine-tuned version of Qwen2.5-7B-Instruct on OpenThoughts dataset
VLMs & vision ๐ > Alibaba Qwen is back with Qwen2.5VL, amazing new capabilities ranging from agentic computer use to zero-shot localization ๐ฅ > NVIDIA released new series of Eagle2 models with 1B and 9B sizes > DeepSeek released Janus-Pro, new any-to-any model (image-text generation from image-text input) with MIT license > BEN2 is a new background removal model with MIT license!
Audio ๐ฃ๏ธ > YuE is a new open-source music generation foundation model, lyrics-to-song generation
Finally, an open-source AI that turns your lyrics into full songs is hereโmeet YuE! Unlike other tools that only create short clips, YuE can make entire songs (up to 5 minutes) with vocals, melody, and instruments all working together. Letsss go!
Multimodal ๐ฌ - We have released SmolVLM -- tiniest VLMs that come in 256M and 500M, with it's retrieval models ColSmol for multimodal RAG ๐ - UI-TARS are new models by ByteDance to unlock agentic GUI control ๐คฏ in 2B, 7B and 72B - Alibaba DAMO lab released VideoLlama3, new video LMs that come in 2B and 7B - MiniMaxAI released Minimax-VL-01, where decoder is based on MiniMax-Text-01 456B MoE model with long context - Dataset: Yale released a new benchmark called MMVU - Dataset: CAIS released Humanity's Last Exam (HLE) a new challenging MM benchmark
LLMs ๐ - DeepSeek-R1 & DeepSeek-R1-Zero: gigantic 660B reasoning models by DeepSeek, and six distilled dense models, on par with o1 with MIT license! ๐คฏ - Qwen2.5-Math-PRM: new math models by Qwen in 7B and 72B - NVIDIA released AceMath and AceInstruct, new family of models and their datasets (SFT and reward ones too!)
Audio ๐ฃ๏ธ - Llasa is a new speech synthesis model based on Llama that comes in 1B,3B, and 8B - TangoFlux is a new audio generation model trained from scratch and aligned with CRPO
Image/Video/3D Generation โฏ๏ธ - Flex.1-alpha is a new 8B pre-trained diffusion model by ostris similar to Flux - tencent released Hunyuan3D-2, new 3D asset generation from images
smolagents can see ๐ฅ we just shipped vision support to smolagents ๐ค agentic computers FTW
you can now: ๐ป let the agent get images dynamically (e.g. agentic web browser) ๐ pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc) with few LoC change! ๐คฏ you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) ๐ค
I am happy to release two new language models for the Italian Language!
๐ช Gemma 2 9B Neogenesis ITA anakin87/gemma-2-9b-neogenesis-ita Building on the impressive work by VAGO Solutions, I applied Direct Preference Optimization with a mix of Italian and English data. Using Spectrum, I trained 20% of model layers.
๐ Evaluated on the Open ITA LLM leaderboard (mii-llm/open_ita_llm_leaderboard), this model achieves strong performance. To beat it on this benchmark, you'd need a 27B model ๐
๐ค Gemma 2 2B Neogenesis ITA anakin87/gemma-2-2b-neogenesis-ita This smaller variant is fine-tuned from the original Gemma 2 2B it by Google. Through a combination of Supervised Fine-Tuning and Direct Preference Optimization, I trained 25% of the layers using Spectrum.
๐ Compared to the original model, it shows improved Italian proficiency, good for its small size.
๐ Multimodal - MiniCPM-o 2.6 is a new sota any-to-any model by OpenBMB (vision, speech and text!) - VideoChat-Flash-Qwen2.5-2B is new video multimodal models by OpenGVLab that come in sizes 2B & 7B in resolutions 224 & 448 - ByteDance released larger SA2VA that comes in 26B parameters - Dataset: VRC-Bench is a new diverse benchmark for multimodal LLM reasoning performance
๐ฌ LLMs - MiniMax-Text-01 is a new huge language model (456B passive 45.9B active params) by MiniMaxAI with context length of 4M tokens ๐คฏ - Dataset: Sky-T1-data-17k is a diverse dataset used to train Sky-T1-32B - kyutai released Helium-1-Preview-2B is a new small multilingual LM - Wayfarer-12B is a new LLM able to write D&D ๐ง๐ปโโ๏ธ - ReaderLM-v2 is a new HTML parsing model by Jina AI - Dria released, Dria-Agent-a-3B, new agentic coding model (Pythonic function calling) based on Qwen2.5 Coder - Unsloth released Phi-4, faster and memory efficient Llama 3.3
๐ผ๏ธ Vision - MatchAnything is a new foundation model for matching - FitDit is a high-fidelity VTON model based on DiT architecture
๐ฃ๏ธ Audio - OuteTTS-0.3-1B is a new multilingual text-to-speech model with voice cloning and emotion control capabilities
๐ Retrieval - lightblue released a new reranker based on Qwen2.5 LB-reranker-0.5B-v1.0 that can handle 95+ languages - cde-small-v2 is a new sota small retrieval model by @jxm