metadata
license: apache-2.0
datasets:
- HuggingFaceM4/DocumentVQA
language:
- en
library_name: transformers
Model Card for Florence-2-FT-DocVQA
This model card provides details about the Florence-2-FT-DocVQA model, which is fine-tuned for Document Visual Question Answering (VQA) tasks.
Model Details
Model Description
Developed by: Mayank Chaudhary
Model type: AutoModelForCausalLM
Language(s) (NLP): English
License: apache-2.0
Finetuned from model: Florence-2-base-ft
The Florence-2-FT-DocVQA model is designed to handle Document VQA tasks, enabling automated question answering based on document images.
Model Sources
- Repository: GitHub - FineTuning-VLMs
- Paper [optional]: arXiv:2311.06242
Uses
The model can be further fine-tuned for specific Document VQA tasks or integrated into applications requiring automated document question answering.
Requirements
datasets transformers torch Pillow
How to Get Started with the Model
To get started with the Florence-2-FT-DocVQA model, you can use the following code:
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoProcessor
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModelForCausalLM.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA").to(device)
processor = AutoProcessor.from_pretrained("mynkchaudhry/Florence-2-FT-DocVQA")
data = load_dataset("HuggingFaceM4/DocumentVQA")
def run_example(task_prompt, text_input, image):
prompt = task_prompt + text_input
if image.mode != "RGB":
image = image.convert("RGB")
inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
generated_ids = model.generate(
input_ids=inputs["input_ids"],
pixel_values=inputs["pixel_values"],
max_new_tokens=1024,
num_beams=3
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=False)[0]
parsed_answer = processor.post_process_generation(generated_text, task=task_prompt, image_size=(image.width, image.height))
return parsed_answer
for idx in range(3):
print(run_example("DocVQA", 'What do you see in this image?', data['train'][idx]['image']))