jebadiah greenwood

Jebadiah

AI & ML interests

None yet

Recent Activity

updated a model 21 days ago
Jebadiah/Luna-dream-02
View all activity

Organizations

Void's profile picture

Jebadiah's activity

New activity in featherless-ai/try-this-model 5 months ago

jeiku/Aura-NeMo-12B

2
#2 opened 5 months ago by
Jebadiah
reacted to merve's post with πŸš€ 7 months ago
view post
Post
3252
Forget any document retrievers, use ColPali πŸ’₯πŸ’₯

Document retrieval is done through OCR + layout detection, but you are losing a lot of information in between, stop doing that! πŸ€“

ColPali uses a vision language model, which is better in doc understanding πŸ“‘
ColPali: vidore/colpali (mit license!)
Blog post: https://huggingface.co./blog/manu/colpali
The authors also released a new benchmark for document retrieval:
ViDoRe Benchmark: vidore/vidore-benchmark-667173f98e70a1c0fa4db00d
ViDoRe Leaderboard: vidore/vidore-leaderboard

ColPali marries the idea of modern vision language models with retrieval 🀝

The authors apply contrastive fine-tuning to SigLIP on documents, and pool the outputs (they call it BiSigLip). Then they feed the patch embedding outputs to PaliGemma and create BiPali πŸ–‡οΈ
BiPali natively supports image patch embeddings to an LLM, which enables leveraging the ColBERT-like late interaction computations between text tokens and image patches (hence the name ColPali!) 🀩

The authors created the ViDoRe benchmark by collecting PDF documents and generate queries from Claude-3 Sonnet.
ColPali seems to be the most performant model on ViDoRe. Not only this, but is way faster than traditional PDF parsers too!
reacted to merve's post with πŸ˜ŽπŸ‘€πŸ‘πŸ§ πŸ€― 7 months ago
view post
Post
3595
EPFL and Apple (at @EPFL-VILAB ) just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! πŸ™€
4M is a multimodal training framework introduced by Apple and EPFL.
Resulting model takes image and text and output image and text 🀩

Models: EPFL-VILAB/4m-models-660193abe3faf4b4d98a2742
Demo: EPFL-VILAB/4M
Paper: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities (2406.09406)

This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:

input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!

This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation πŸ–ΌοΈ

The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well ☺️