-
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 25 -
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 12 -
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 38 -
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 19
Collections
Discover the best community collections!
Collections including paper arxiv:2403.18978
-
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper • 2403.18978 • Published • 13 -
InstantStyle: Free Lunch towards Style-Preserving in Text-to-Image Generation
Paper • 2404.02733 • Published • 20 -
OmniFusion Technical Report
Paper • 2404.06212 • Published • 74 -
Transferable and Principled Efficiency for Open-Vocabulary Segmentation
Paper • 2404.07448 • Published • 11
-
Improving Text-to-Image Consistency via Automatic Prompt Optimization
Paper • 2403.17804 • Published • 16 -
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
Paper • 2403.16990 • Published • 25 -
Getting it Right: Improving Spatial Consistency in Text-to-Image Models
Paper • 2404.01197 • Published • 30 -
Condition-Aware Neural Network for Controlled Image Generation
Paper • 2404.01143 • Published • 11
-
Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation
Paper • 2403.12015 • Published • 64 -
Be Yourself: Bounded Attention for Multi-Subject Text-to-Image Generation
Paper • 2403.16990 • Published • 25 -
SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
Paper • 2403.16627 • Published • 20 -
TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models
Paper • 2403.17005 • Published • 13
-
CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows
Paper • 2107.00652 • Published • 2 -
Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
Paper • 2403.09622 • Published • 16 -
Veagle: Advancements in Multimodal Representation Learning
Paper • 2403.08773 • Published • 7 -
mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality
Paper • 2304.14178 • Published • 2
-
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
Paper • 2312.13964 • Published • 18 -
LLM in a flash: Efficient Large Language Model Inference with Limited Memory
Paper • 2312.11514 • Published • 258 -
StreamDiffusion: A Pipeline-level Solution for Real-time Interactive Generation
Paper • 2312.12491 • Published • 69 -
LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
Paper • 2401.02330 • Published • 14