Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2306.07691

Paper - Multimodal

Paper related to Multimodal Model - Research for a : Modular, Multimodal, Multi-Stream, Mixture of Expert, Universal Transformer, Matryoshka embedding

Flowing from Words to Pixels: A Framework for Cross-Modality Evolution

Paper • 2412.15213 • Published Dec 19, 2024 • 26
No More Adam: Learning Rate Scaling at Initialization is All You Need

Paper • 2412.11768 • Published Dec 16, 2024 • 41
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Paper • 2412.13663 • Published Dec 18, 2024 • 125
Autoregressive Video Generation without Vector Quantization

Paper • 2412.14169 • Published Dec 18, 2024 • 14

Zero-Shot Voice Cloning

TTS models that support zero-shot voice cloning

MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer

Paper • 2409.00750 • Published Sep 1, 2024 • 3
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Paper • 2410.06885 • Published Oct 9, 2024 • 43
StyleTTS-ZS: Efficient High-Quality Zero-Shot Text-to-Speech Synthesis with Distilled Time-Varying Style Diffusion

Paper • 2409.10058 • Published Sep 16, 2024 • 2
E2 TTS: Embarrassingly Easy Fully Non-Autoregressive Zero-Shot TTS

Paper • 2406.18009 • Published Jun 26, 2024 • 21

How to Synthesize Text Data without Model Collapse?

Paper • 2412.14689 • Published Dec 19, 2024 • 48
SepLLM: Accelerate Large Language Models by Compressing One Segment into One Separator

Paper • 2412.12094 • Published Dec 16, 2024 • 10
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Paper • 2306.07691 • Published Jun 13, 2023 • 5
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform

Paper • 2203.02395 • Published Mar 4, 2022

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

Paper • 2312.04461 • Published Dec 7, 2023 • 62
InstantID: Zero-shot Identity-Preserving Generation in Seconds

Paper • 2401.07519 • Published Jan 15, 2024 • 55
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

Paper • 2306.07691 • Published Jun 13, 2023 • 5

there's many more on arxiv if you search for CLAP

Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation

Paper • 2211.06687 • Published Nov 12, 2022 • 3
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Paper • 2401.17690 • Published Jan 31, 2024 • 5
Amphion: An Open-Source Audio, Music and Speech Generation Toolkit

Paper • 2312.09911 • Published Dec 15, 2023 • 54
Audiobox: Unified Audio Generation with Natural Language Prompts

Paper • 2312.15821 • Published Dec 25, 2023 • 14

Company

TOS Privacy About Jobs

Website

Models Datasets Spaces Pricing Docs