Multimodal 🖼️ > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook — 22k hours worth of samples from instruction videos 🤯 > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
LLMs 💬 > Microsoft released Phi-4, sota open-source 14B language model 🔥 > Dolphin is back with Dolphin 3.0 Llama 3.1 8B 🐬🐬 > Prime-RL released Eurus-2-7B-PRIME a new language model trained using PRIME alignment > SmallThinker-3B is a new small reasoning LM based on Owen2.5-3B-Instruct 💭 > Dataset: QWQ-LONGCOT-500K is the dataset used to train SmallThinker, generated using QwQ-32B-preview 📕 > Dataset: @cfahlgren1 released React Code Instructions: a dataset of code instruction-code pairs 📕 > Dataset: Qwen team is on the roll, they just released CodeElo, a dataset of code preferences 👩🏻💻
Embeddings 🔖 > @MoritzLaurer released zero-shot version of ModernBERT large 👏 > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation ⏯️ > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts 🔥 > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding
> The models are capable of tasks involving vision-language understanding and visual referrals (referring segmentation) both for images and videos ⏯️
> The models come in 1B, 4B and 8B and are based on InternVL2.5 for base architecture and Qwen2, Qwen2.5 and InternLM2 for language model part (depending on the checkpoint)
> The model is very interesting, it has different encoders for different modalities each (visual prompt, text prompt, image and video) then it concatenates these to feed into LLM 💬
the output segmentation tokens are passed to SAM2, to sort of match text (captions or semantic classes) to masks ⤵️
> Their annotation pipeline is also interesting, they seems to use two open large vision LMs to refine the annotations, and have different levels of descriptions to provide consistency.