Structured Generation from Images or Documents Using Vision Language Models
We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents. We will run the VLM using the Hugging Face Transformers library and the Outlines library, which facilitates structured generation based on limiting token sampling probabilities.
This approach is based on a Outlines tutorial.
Dependencies and imports
First, let’s install the necessary libraries.
%pip install accelerate outlines transformers torch flash-attn datasets sentencepiece
Let’s continue with importing the necessary libraries.
import outlines
import torch
from datasets import load_dataset
from outlines.models.transformers_vision import transformers_vision
from transformers import AutoModelForImageTextToText, AutoProcessor
from pydantic import BaseModel
Initialising our model
We will start by initialising our model from HuggingFaceTB/SmolVLM-Instruct. Outlines expects us to pass in a model class and processor class, so we will make this example a bit more generic by creating a function that returns those. Alternatively, you could look at the model and tokenizer config within the Hub repo files, and import those classes directly.
model_name = "HuggingFaceTB/SmolVLM-Instruct"
def get_model_and_processor_class(model_name: str):
model = AutoModelForImageTextToText.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)
classes = model.__class__, processor.__class__
del model, processor
return classes
model_class, processor_class = get_model_and_processor_class(model_name)
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
else:
device = "cpu"
model = transformers_vision(
model_name,
model_class=model_class,
device=device,
model_kwargs={"torch_dtype": torch.bfloat16, "device_map": "auto"},
processor_kwargs={"device": device},
processor_class=processor_class,
)
Structured Generation
Now, we are going to define a function that will define how the output of our model will be structured. We will be using the openbmb/RLAIF-V-Dataset, which contains a set of images along with questions and their chosen and rejected reponses. This is an okay dataset but we want to create additional text-image-to-text data on top of the images to get our own structured dataset, and potentially fine-tune our model on it. We will use the model to generate a caption, a question and a simple quality tag for the image.
class ImageData(BaseModel):
quality: str
description: str
question: str
structured_generator = outlines.generate.json(model, ImageData)
Now, let’s come up with an extraction prompt.
prompt = """
You are an image analysis assisant.
Provide a quality tag, a description and a question.
The quality can either be "good", "okay" or "bad".
The question should be concise and objective.
Return your response as a valid JSON object.
""".strip()
Let’s load our image dataset.
dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train[:10]")
dataset
Now, let’s define a function that will extract the structured information from the image. We will format the prompt using the apply_chat_template
method and pass it to the model along with the image after that.
def extract(row):
messages = [
{
"role": "user",
"content": [{"type": "image"}, {"type": "text", "text": prompt}],
},
]
formatted_prompt = model.processor.apply_chat_template(messages, add_generation_prompt=True)
result = structured_generator(formatted_prompt, [row["image"]])
row["synthetic_question"] = result.question
row["synthetic_description"] = result.description
row["synthetic_quality"] = result.quality
return row
dataset = dataset.map(lambda x: extract(x))
dataset
Let’s now push our new dataset to the Hub.
dataset.push_to_hub(
"davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset", split="train"
)
The results are not perfect, but they are a good starting point to continue exploring with different models and prompts!
Conclusion
We’ve seen how to extract structured information from documents using a vision language model. We can use similar extractive methods to extract structured information from documents, using somehting like pdf2image
to convert the document to images and running information extraction on each image pdf of the page.
pdf_path = "path/to/your/pdf/file.pdf"
pages = convert_from_path(pdf_path)
for page in pages:
extract_objects = extract_objects(page, prompt)
Next Steps
- Take a look at the Outlines library for more information on how to use it. Explore the different methods and parameters.
- Explore extraction on your own usecase with your own model.
- Use a different method of extracting structured information from documents.