File size: 3,132 Bytes

f0350d0
 
 
cd41ef9
 
f0350d0
 
 
 
 
 
 
 
cd41ef9
 
 
f0350d0
c746064
f0350d0
cd41ef9
f0350d0
c746064
 
 
 
 
 
 
 
 
cd41ef9
 
 
c746064
 
 
 
 
cd41ef9
c746064
cd41ef9
c746064
cd41ef9
 
 
c746064
cd41ef9
c746064
 
 
cd41ef9
c746064
cd41ef9
 
 
 
c746064
 
 
cd41ef9
c746064
cd41ef9
 
 
 
c746064
 
 
 
 
 
cd41ef9
c746064
 
 
 
cd41ef9
 
 
c746064
 
 
 
cd41ef9

---
base_model: unsloth/Qwen2-VL-7B-Instruct
tags:
- document-parsing
- information-extraction
- transformers
- unsloth
- qwen2_vl
license: apache-2.0
language:
- en
---

![image](./image.webp)

# VisionParser-VL-Expert

**Developed by:** Daemontatox

**Model Type:** Fine-tuned Vision-Language Model (VLM)

**Base Model:** [unsloth/Qwen2-VL-7B-Instruct](https://huggingface.co./unsloth/Qwen2-VL-7B-Instruct)

**Finetuned from model:** unsloth/Qwen2-VL-7B-Instruct

**License:** apache-2.0

**Languages:** en

**Tags:**
- document-parsing
- information-extraction
- vision-language
- unsloth
- qwen2_vl

## Model Description

VisionParser-VL-Expert is a fine-tuned version of [unsloth/Qwen2-VL-7B-Instruct](https://huggingface.co./unsloth/Qwen2-VL-7B-Instruct), designed specifically for document parsing and extraction tasks. It excels in interpreting and extracting structured data from images of documents, such as invoices, forms, and reports.

The finetuning process utilized [QLoRA](https://huggingface.co./docs/transformers/main_classes/peft#qlora) with [Unsloth](https://github.com/unslothai/unsloth) and the Hugging Face TRL library, enabling efficient training with minimal resource overhead. This model demonstrates significant improvements in:

- Extracting textual information from visually complex layouts.
- Recognizing tabular and hierarchical data structures.
- Generating accurate and contextually rich text outputs for document understanding.

Datasets used include a combination of publicly available document datasets (e.g., FUNSD, DocVQA) and proprietary annotated data for domain-specific applications.

## Intended Uses

VisionParser-VL-Expert is intended for:

- Extracting data from scanned documents, invoices, and forms.
- Parsing and analyzing structured layouts such as tables and charts.
- Generating textual summaries of visual content in documents.
- Supporting OCR systems by providing contextually enriched outputs.

## Limitations

While VisionParser-VL-Expert is powerful, it has certain limitations:

- May struggle with low-quality or heavily distorted images.
- Biases from training data might influence performance.
- Limited support for languages other than English.
- Performance can vary with highly complex or novel document layouts.

## How to Use

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "daemontatox/visionparser-vl-expert"  # Replace with the actual model name

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Example usage with text and image
prompt = "Extract key details from the document: "
image_path = "path/to/your/document_image.jpg"  # Replace with your image path

inputs = tokenizer(prompt, images=image_path, return_tensors="pt")
outputs = model.generate(**inputs)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```

## Acknowledgements

Special thanks to the Unsloth team for their robust tools enabling efficient fine-tuning. This model was developed with the help of open-source libraries and community datasets.