MiniCPM-o2.6 š„ an end-side multimodal LLMs released by OpenBMB from the Chinese community Model: openbmb/MiniCPM-o-2_6 āØ Real-time English/Chinese conversation, emotion control and ASR/STT āØ Real-time video/audio understanding āØ Processes up to 1.8M pixels, leads OCRBench & supports 30+ languages
Multimodal š¼ļø > ByteDance released SA2VA: a family of vision LMs that can take image, video, text and visual prompts > moondream2 is out with new capabilities like outputting structured data and gaze detection! > Dataset: Alibaba DAMO lab released multimodal textbook ā 22k hours worth of samples from instruction videos š¤Æ > Dataset: SciCap captioning on scientific documents benchmark dataset is released along with the challenge!
Embeddings š > @MoritzLaurer released zero-shot version of ModernBERT large š > KaLM is a new family of performant multilingual embedding models with MIT license built using Qwen2-0.5B
Image/Video Generation āÆļø > NVIDIA released Cosmos, a new family of diffusion/autoregressive World Foundation Models generating worlds from images, videos and texts š„ > Adobe released TransPixar: a new text-to-video model that can generate assets with transparent backgrounds (a first!) > Dataset: fal released cosmos-openvid-1m Cosmos-tokenized OpenVid-1M with samples from OpenVid-1M
Others > Prior Labs released TabPFNv2, the best tabular transformer is out for classification and regression > Metagene-1 is a new RNA language model that can be used for pathogen detection, zero-shot embedding and genome understanding