Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks Paper • 2311.06242 • Published Nov 10, 2023 • 89
GIT Collection GIT (Generative Image-to-text Transformer) is a model useful for vision-language tasks such as image/video captioning and question answering. • 18 items • Updated 19 days ago • 11
UDOP Collection UDOP is a general multimodal model for document AI • 4 items • Updated 19 days ago • 24
Table Transformer Collection The Table Transformer (TATR) is a series of object detection models useful for table extraction from PDF images. • 5 items • Updated 19 days ago • 21
TAPEX Collection TAPEX is the state-of-the-art table pre-training models which can be used for table-based question answering and table-based fact verification. • 10 items • Updated 19 days ago • 9
SpeechT5 Collection The SpeechT5 framework consists of a shared seq2seq and six modal-specific (speech/text) pre/post-nets that can address a few audio-related tasks. • 8 items • Updated 19 days ago • 24
LayoutLM Collection The LayoutLM series are Transformer encoders useful for document AI tasks such as invoice parsing, document image classification and DocVQA. • 5 items • Updated 19 days ago • 15
Phi-3 Collection Phi-3 family of small language and multi-modal models. Language models are available in short- and long-context lengths. • 26 items • Updated 19 days ago • 548
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone Paper • 2404.14219 • Published Apr 22, 2024 • 255
DEsignBench: Exploring and Benchmarking DALL-E 3 for Imagining Visual Design Paper • 2310.15144 • Published Oct 23, 2023 • 14