Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
82.9
TFLOPS
1
1
Daniel Bram
keerekeerweere
Follow
0 followers
·
1 following
AI & ML interests
None yet
Recent Activity
reacted
to
singhsidhukuldeep
's
post
with 👀
about 2 months ago
Exciting breakthrough in multimodal search technology! @nvidia researchers have developed MM-Embed, a groundbreaking universal multimodal retrieval system that's changing how we think about search. Key innovations: • First-ever universal multimodal retriever that excels at both text and image searches across diverse tasks • Leverages advanced multimodal LLMs to understand complex queries combining text and images • Implements novel modality-aware hard negative mining to overcome modality bias issues • Achieves state-of-the-art performance on M-BEIR benchmark while maintaining superior text retrieval capabilities Under the hood: The system uses a sophisticated bi-encoder architecture with LLaVa-Next (based on Mistral 7B) as its backbone. It employs a unique two-stage training approach: first with random negatives, then with carefully mined hard negatives to improve cross-modal understanding. The real magic happens in the modality-aware negative mining, where the system learns to distinguish between incorrect modality matches and unsatisfactory information matches, ensuring retrieved results match both content and format requirements. What sets it apart is its ability to handle diverse search scenarios - from simple text queries to complex combinations of images and text, all while maintaining high accuracy across different domains
reacted
to
singhsidhukuldeep
's
post
with 🤗
about 2 months ago
Exciting breakthrough in Document AI! Researchers from UNC Chapel Hill and Bloomberg have developed M3DocRAG, a revolutionary framework for multi-modal document understanding. The innovation lies in its ability to handle complex document scenarios that traditional systems struggle with: - Process 40,000+ pages across 3,000+ documents - Answer questions requiring information from multiple pages - Understand visual elements like charts, tables, and figures - Support both closed-domain (single document) and open-domain (multiple documents) queries Under the hood, M3DocRAG operates through three sophisticated stages: >> Document Embedding: - Converts PDF pages to RGB images - Uses ColPali to project both text queries and page images into a shared embedding space - Creates dense visual embeddings for each page while maintaining visual information integrity >> Page Retrieval: - Employs MaxSim scoring to compute relevance between queries and pages - Implements inverted file indexing (IVFFlat) for efficient search - Reduces retrieval latency from 20s to under 2s when searching 40K+ pages - Supports approximate nearest neighbor search via Faiss >> Question Answering: - Leverages Qwen2-VL 7B as the multi-modal language model - Processes retrieved pages through a visual encoder - Generates answers considering both textual and visual context The results are impressive: - State-of-the-art performance on MP-DocVQA benchmark - Superior handling of non-text evidence compared to text-only systems - Significantly better performance on multi-hop reasoning tasks This is a game-changer for industries dealing with large document volumes—finance, healthcare, and legal sectors can now process documents more efficiently while preserving crucial visual context.
upvoted
an
article
9 months ago
Welcome Llama 3 - Meta's new open LLM
View all activity
Organizations
None yet
keerekeerweere
's activity
All
Models
Datasets
Spaces
Papers
Collections
Community
Posts
Upvotes
Likes
upvoted
an
article
9 months ago
view article
Article
Welcome Llama 3 - Meta's new open LLM
Apr 18, 2024
•
282