Towards Diverse and Efficient Audio Captioning via Diffusion Models Paper • 2409.09401 • Published 6 days ago • 6
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer Paper • 2409.10819 • Published 3 days ago • 11
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think Paper • 2409.11355 • Published 3 days ago • 24
InstantDrag: Improving Interactivity in Drag-based Image Editing Paper • 2409.08857 • Published 7 days ago • 24
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources Paper • 2409.08239 • Published 8 days ago • 15
IFAdapter: Instance Feature Control for Grounded Text-to-Image Generation Paper • 2409.08240 • Published 8 days ago • 14
VMAS: Video-to-Music Generation via Semantic Alignment in Web Music Videos Paper • 2409.07450 • Published 8 days ago • 10
Gated Slot Attention for Efficient Linear-Time Sequence Modeling Paper • 2409.07146 • Published 9 days ago • 18
SaRA: High-Efficient Diffusion Model Fine-tuning with Progressive Sparse Low-Rank Adaptation Paper • 2409.06633 • Published 10 days ago • 14
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis Paper • 2409.06135 • Published 10 days ago • 14
Evaluating Multiview Object Consistency in Humans and Image Models Paper • 2409.05862 • Published 10 days ago • 8
Towards a Unified View of Preference Learning for Large Language Models: A Survey Paper • 2409.02795 • Published 16 days ago • 70
Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task Paper • 2409.04005 • Published 14 days ago • 16
Guide-and-Rescale: Self-Guidance Mechanism for Effective Tuning-Free Real Image Editing Paper • 2409.01322 • Published 18 days ago • 94
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation Paper • 2409.02245 • Published 16 days ago • 9
Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency Paper • 2409.02634 • Published 16 days ago • 84
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges Paper • 2409.01071 • Published 18 days ago • 26
DepthCrafter: Generating Consistent Long Depth Sequences for Open-world Videos Paper • 2409.02095 • Published 16 days ago • 32
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling Paper • 2408.16532 • Published 22 days ago • 44
CogVLM2: Visual Language Models for Image and Video Understanding Paper • 2408.16500 • Published 22 days ago • 55
TEDRA: Text-based Editing of Dynamic and Photoreal Actors Paper • 2408.15995 • Published 22 days ago • 4
Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation Paper • 2408.15239 • Published 23 days ago • 27
The Mamba in the Llama: Distilling and Accelerating Hybrid Models Paper • 2408.15237 • Published 23 days ago • 36
TVG: A Training-free Transition Video Generation Method with Diffusion Models Paper • 2408.13413 • Published 27 days ago • 13
Training-free Long Video Generation with Chain of Diffusion Model Experts Paper • 2408.13423 • Published 27 days ago • 19
SwiftBrush v2: Make Your One-step Diffusion Model Better Than Its Teacher Paper • 2408.14176 • Published 25 days ago • 58
CustomCrafter: Customized Video Generation with Preserving Motion and Concept Composition Abilities Paper • 2408.13239 • Published 28 days ago • 10
Building and better understanding vision-language models: insights and future directions Paper • 2408.12637 • Published 28 days ago • 109
Video-Foley: Two-Stage Video-To-Sound Generation via Temporal Event Condition For Foley Sound Paper • 2408.11915 • Published 29 days ago • 6
Real-Time Video Generation with Pyramid Attention Broadcast Paper • 2408.12588 • Published 28 days ago • 13
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation Paper • 2408.12528 • Published 29 days ago • 50
xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations Paper • 2408.12590 • Published 28 days ago • 33
Controllable Text Generation for Large Language Models: A Survey Paper • 2408.12599 • Published 28 days ago • 61
FRAP: Faithful and Realistic Text-to-Image Generation with Adaptive Prompt Weighting Paper • 2408.11706 • Published 30 days ago • 5
TrackGo: A Flexible and Efficient Method for Controllable Video Generation Paper • 2408.11475 • Published 30 days ago • 16
TWLV-I: Analysis and Insights from Holistic Evaluation on Video Foundation Models Paper • 2408.11318 • Published about 1 month ago • 54
Audio Match Cutting: Finding and Creating Matching Audio Transitions in Movies and Videos Paper • 2408.10998 • Published about 1 month ago • 7
NeCo: Improving DINOv2's spatial representations in 19 GPU hours with Patch Neighbor Consistency Paper • 2408.11054 • Published about 1 month ago • 10
MegaFusion: Extend Diffusion Models towards Higher-resolution Image Generation without Further Tuning Paper • 2408.11001 • Published about 1 month ago • 11
LongVILA: Scaling Long-Context Visual Language Models for Long Videos Paper • 2408.10188 • Published Aug 19 • 51
Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering Paper • 2408.09702 • Published Aug 19 • 9
TraDiffusion: Trajectory-Based Training-Free Image Generation Paper • 2408.09739 • Published Aug 19 • 7
Factorized-Dreamer: Training A High-Quality Video Generator with Limited and Low-Quality Data Paper • 2408.10119 • Published Aug 19 • 15
NeuFlow v2: High-Efficiency Optical Flow Estimation on Edge Devices Paper • 2408.10161 • Published Aug 19 • 11
xGen-MM (BLIP-3): A Family of Open Large Multimodal Models Paper • 2408.08872 • Published Aug 16 • 96
JPEG-LM: LLMs as Image Generators with Canonical Codec Representations Paper • 2408.08459 • Published Aug 15 • 44