---
base_model: unsloth/Qwen2-VL-7B-Instruct
tags:
- document-parsing
- information-extraction
- transformers
- unsloth
- qwen2_vl
license: apache-2.0
language:
- en
---

![image](./image.webp)

# VisionParser-VL-Expert

**Developed by:** Daemontatox

**Model Type:** Fine-tuned Vision-Language Model (VLM)

**Base Model:** [unsloth/Qwen2-VL-7B-Instruct](https://huggingface.co./unsloth/Qwen2-VL-7B-Instruct)

**Finetuned from model:** unsloth/Qwen2-VL-7B-Instruct

**License:** apache-2.0

**Languages:** en

**Tags:**
- document-parsing
- information-extraction
- vision-language
- unsloth
- qwen2_vl

## Model Description

VisionParser-VL-Expert is a fine-tuned version of [unsloth/Qwen2-VL-7B-Instruct](https://huggingface.co./unsloth/Qwen2-VL-7B-Instruct), designed specifically for document parsing and extraction tasks. It excels in interpreting and extracting structured data from images of documents, such as invoices, forms, and reports.

The finetuning process utilized [QLoRA](https://huggingface.co./docs/transformers/main_classes/peft#qlora) with [Unsloth](https://github.com/unslothai/unsloth) and the Hugging Face TRL library, enabling efficient training with minimal resource overhead. This model demonstrates significant improvements in:

- Extracting textual information from visually complex layouts.
- Recognizing tabular and hierarchical data structures.
- Generating accurate and contextually rich text outputs for document understanding.

Datasets used include a combination of publicly available document datasets (e.g., FUNSD, DocVQA) and proprietary annotated data for domain-specific applications.

## Intended Uses

VisionParser-VL-Expert is intended for:

- Extracting data from scanned documents, invoices, and forms.
- Parsing and analyzing structured layouts such as tables and charts.
- Generating textual summaries of visual content in documents.
- Supporting OCR systems by providing contextually enriched outputs.

## Limitations

While VisionParser-VL-Expert is powerful, it has certain limitations:

- May struggle with low-quality or heavily distorted images.
- Biases from training data might influence performance.
- Limited support for languages other than English.
- Performance can vary with highly complex or novel document layouts.

## How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "daemontatox/visionparser-vl-expert"  # Replace with the actual model name

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example usage with text and image
prompt = "Extract key details from the document: "
image_path = "path/to/your/document_image.jpg"  # Replace with your image path

inputs = tokenizer(prompt, images=image_path, return_tensors="pt")
outputs = model.generate(**inputs)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Acknowledgements

Special thanks to the Unsloth team for their robust tools enabling efficient fine-tuning. This model was developed with the help of open-source libraries and community datasets.