Learning Flow Fields in Attention for Controllable Person Image Generation Paper • 2412.08486 • Published 14 days ago • 32
STIV: Scalable Text and Image Conditioned Video Generation Paper • 2412.07730 • Published 15 days ago • 69
3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark Paper • 2412.07825 • Published 14 days ago • 12
Mogo: RQ Hierarchical Causal Transformer for High-Quality 3D Human Motion Generation Paper • 2412.07797 • Published 20 days ago • 11
LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations Paper • 2412.08580 • Published 14 days ago • 44
SynCamMaster: Synchronizing Multi-Camera Video Generation from Diverse Viewpoints Paper • 2412.07760 • Published 14 days ago • 49
VisionZip: Longer is Better but Not Necessary in Vision Language Models Paper • 2412.04467 • Published 19 days ago • 104
Teach Multimodal LLMs to Comprehend Electrocardiographic Images Paper • 2410.19008 • Published Oct 21 • 23
Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models Paper • 2411.04996 • Published Nov 7 • 49
DimensionX: Create Any 3D and 4D Scenes from a Single Image with Controllable Video Diffusion Paper • 2411.04928 • Published Nov 7 • 48
ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning Paper • 2411.05003 • Published Nov 7 • 70
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models Paper • 2411.04905 • Published Nov 7 • 111
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents Paper • 2410.23218 • Published Oct 30 • 46
SALSA: Soup-based Alignment Learning for Stronger Adaptation in RLHF Paper • 2411.01798 • Published Nov 4 • 8
Decoding Dark Matter: Specialized Sparse Autoencoders for Interpreting Rare Concepts in Foundation Models Paper • 2411.00743 • Published Nov 1 • 6
AutoVFX: Physically Realistic Video Editing from Natural Language Instructions Paper • 2411.02394 • Published Nov 4 • 17
LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models Paper • 2411.00918 • Published Nov 1 • 8
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance Paper • 2411.02327 • Published Nov 4 • 11