Structured Generation from Images or Documents Using Vision Language Models

We will be using the SmolVLM-Instruct model from HuggingFaceTB to extract structured information from documents. We will run the VLM using the Hugging Face Transformers library and the Outlines library, which facilitates structured generation based on limiting token sampling probabilities.

This approach is based on a Outlines tutorial.

Dependencies and imports

First, let’s install the necessary libraries.

%pip install accelerate outlines transformers torch flash-attn datasets sentencepiece

Let’s continue with importing the necessary libraries.

import outlines
import torch

from datasets import load_dataset
from outlines.models.transformers_vision import transformers_vision
from transformers import AutoModelForImageTextToText, AutoProcessor
from pydantic import BaseModel

Initialising our model

We will start by initialising our model from HuggingFaceTB/SmolVLM-Instruct. Outlines expects us to pass in a model class and processor class, so we will make this example a bit more generic by creating a function that returns those. Alternatively, you could look at the model and tokenizer config within the Hub repo files, and import those classes directly.

model_name = "HuggingFaceTB/SmolVLM-Instruct"


def get_model_and_processor_class(model_name: str):
    model = AutoModelForImageTextToText.from_pretrained(model_name)
    processor = AutoProcessor.from_pretrained(model_name)
    classes = model.__class__, processor.__class__
    del model, processor
    return classes


model_class, processor_class = get_model_and_processor_class(model_name)

if torch.cuda.is_available():
    device = "cuda"
elif torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cpu"

model = transformers_vision(
    model_name,
    model_class=model_class,
    device=device,
    model_kwargs={"torch_dtype": torch.bfloat16, "device_map": "auto"},
    processor_kwargs={"device": device},
    processor_class=processor_class,
)

Structured Generation

Now, we are going to define a function that will define how the output of our model will be structured. We will be using the openbmb/RLAIF-V-Dataset, which contains a set of images along with questions and their chosen and rejected reponses. This is an okay dataset but we want to create additional text-image-to-text data on top of the images to get our own structured dataset, and potentially fine-tune our model on it. We will use the model to generate a caption, a question and a simple quality tag for the image.

class ImageData(BaseModel):
    quality: str
    description: str
    question: str


structured_generator = outlines.generate.json(model, ImageData)

Now, let’s come up with an extraction prompt.

prompt = """
You are an image analysis assisant.

Provide a quality tag, a description and a question.

The quality can either be "good", "okay" or "bad".
The question should be concise and objective.

Return your response as a valid JSON object.
""".strip()

Let’s load our image dataset.

dataset = load_dataset("openbmb/RLAIF-V-Dataset", split="train[:10]")
dataset

Now, let’s define a function that will extract the structured information from the image. We will format the prompt using the apply_chat_template method and pass it to the model along with the image after that.

def extract(row):
    messages = [
        {
            "role": "user",
            "content": [{"type": "image"}, {"type": "text", "text": prompt}],
        },
    ]

    formatted_prompt = model.processor.apply_chat_template(messages, add_generation_prompt=True)

    result = structured_generator(formatted_prompt, [row["image"]])
    row["synthetic_question"] = result.question
    row["synthetic_description"] = result.description
    row["synthetic_quality"] = result.quality
    return row


dataset = dataset.map(lambda x: extract(x))
dataset

Let’s now push our new dataset to the Hub.

dataset.push_to_hub(
    "davidberenstein1957/structured-generation-information-extraction-vlms-openbmb-RLAIF-V-Dataset", split="train"
)

The results are not perfect, but they are a good starting point to continue exploring with different models and prompts!

Conclusion

We’ve seen how to extract structured information from documents using a vision language model. We can use similar extractive methods to extract structured information from documents, using somehting like pdf2image to convert the document to images and running information extraction on each image pdf of the page.

pdf_path = "path/to/your/pdf/file.pdf"
pages = convert_from_path(pdf_path)
for page in pages:
    extract_objects = extract_objects(page, prompt)

Next Steps

Take a look at the Outlines library for more information on how to use it. Explore the different methods and parameters.
Explore extraction on your own usecase with your own model.
Use a different method of extracting structured information from documents.