Models
Datasets
Spaces
Posts
Docs
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2408.16532

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6 • 25
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6 • 12
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7 • 38
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7 • 19

Foundation Models for Music: A Survey

Paper • 2408.14340 • Published Aug 26 • 40
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29 • 46
FLUX that Plays Music

Paper • 2409.00587 • Published Sep 1 • 31
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation

Paper • 2409.02245 • Published Sep 3 • 9

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29 • 46

FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs

Paper • 2407.04051 • Published Jul 4 • 35
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Paper • 2408.16532 • Published Aug 29 • 46
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing

Paper • 2409.10831 • Published Sep 17 • 4

SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound Generation

Paper • 2405.18503 • Published May 28 • 9
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation

Paper • 2405.20289 • Published May 30 • 10
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive Modeling of Audio Discrete Codes

Paper • 2406.02897 • Published Jun 5 • 13
Audio Mamba: Bidirectional State Space Model for Audio Representation Learning

Paper • 2406.03344 • Published Jun 5 • 18

MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training

Paper • 2311.17049 • Published Nov 28, 2023
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Paper • 2405.04434 • Published May 7 • 13
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision

Paper • 2303.17376 • Published Mar 30, 2023
Sigmoid Loss for Language Image Pre-Training

Paper • 2303.15343 • Published Mar 27, 2023 • 4

parler-tts/parler_tts_mini_v0.1

Text-to-Speech • Updated Apr 30 • 25.3k • 346
SpeechGuard: Exploring the Adversarial Robustness of Multimodal Large Language Models

Paper • 2405.08317 • Published May 14 • 9
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities

Paper • 2405.18669 • Published May 29 • 11
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Paper • 2406.02430 • Published Jun 4 • 29

To read... eventually

MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training

Paper • 2403.09611 • Published Mar 14 • 124
Evolutionary Optimization of Model Merging Recipes

Paper • 2403.13187 • Published Mar 19 • 50
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model

Paper • 2402.03766 • Published Feb 6 • 12
LLM Agent Operating System

Paper • 2403.16971 • Published Mar 25 • 65

MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs

Paper • 2402.15627 • Published Feb 23 • 34
Rainbow Teaming: Open-Ended Generation of Diverse Adversarial Prompts

Paper • 2402.16822 • Published Feb 26 • 15
FuseChat: Knowledge Fusion of Chat Models

Paper • 2402.16107 • Published Feb 25 • 36
Multi-LoRA Composition for Image Generation

Paper • 2402.16843 • Published Feb 26 • 28

Company

© Hugging Face

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs