-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 38 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2408.16532
-
Foundation Models for Music: A Survey
Paper • 2408.14340 • Published • 40 -
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Paper • 2408.16532 • Published • 46 -
FLUX that Plays Music
Paper • 2409.00587 • Published • 31 -
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation
Paper • 2409.02245 • Published • 9
-
FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs
Paper • 2407.04051 • Published • 35 -
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling
Paper • 2408.16532 • Published • 46 -
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing
Paper • 2409.10831 • Published • 4
-
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation
Paper • 2405.18503 • Published • 9 -
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Paper • 2405.20289 • Published • 10 -
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes
Paper • 2406.02897 • Published • 13 -
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning
Paper • 2406.03344 • Published • 18
-
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Paper • 2311.17049 • Published -
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Paper • 2405.04434 • Published • 13 -
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Paper • 2303.17376 • Published -
Sigmoid Loss for Language Image Pre-Training
Paper • 2303.15343 • Published • 4
-
parler-tts/parler_tts_mini_v0.1
Text-to-Speech • Updated • 25.3k • 346 -
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models
Paper • 2405.08317 • Published • 9 -
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Paper • 2405.18669 • Published • 11 -
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Paper • 2406.02430 • Published • 29
-
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper • 2403.09611 • Published • 124 -
Evolutionary Optimization of Model Merging Recipes
Paper • 2403.13187 • Published • 50 -
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Paper • 2402.03766 • Published • 12 -
LLM Agent Operating System
Paper • 2403.16971 • Published • 65
-
MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs
Paper • 2402.15627 • Published • 34 -
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts
Paper • 2402.16822 • Published • 15 -
FuseChat: Knowledge Fusion of Chat Models
Paper • 2402.16107 • Published • 36 -
Multi-LoRA Composition for Image Generation
Paper • 2402.16843 • Published • 28