File size: 3,132 Bytes
f0350d0 cd41ef9 f0350d0 cd41ef9 f0350d0 c746064 f0350d0 cd41ef9 f0350d0 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 c746064 cd41ef9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 |
---
base_model: unsloth/Qwen2-VL-7B-Instruct
tags:
- document-parsing
- information-extraction
- transformers
- unsloth
- qwen2_vl
license: apache-2.0
language:
- en
---

# VisionParser-VL-Expert
**Developed by:** Daemontatox
**Model Type:** Fine-tuned Vision-Language Model (VLM)
**Base Model:** [unsloth/Qwen2-VL-7B-Instruct](https://huggingface.co./unsloth/Qwen2-VL-7B-Instruct)
**Finetuned from model:** unsloth/Qwen2-VL-7B-Instruct
**License:** apache-2.0
**Languages:** en
**Tags:**
- document-parsing
- information-extraction
- vision-language
- unsloth
- qwen2_vl
## Model Description
VisionParser-VL-Expert is a fine-tuned version of [unsloth/Qwen2-VL-7B-Instruct](https://huggingface.co./unsloth/Qwen2-VL-7B-Instruct), designed specifically for document parsing and extraction tasks. It excels in interpreting and extracting structured data from images of documents, such as invoices, forms, and reports.
The finetuning process utilized [QLoRA](https://huggingface.co./docs/transformers/main_classes/peft#qlora) with [Unsloth](https://github.com/unslothai/unsloth) and the Hugging Face TRL library, enabling efficient training with minimal resource overhead. This model demonstrates significant improvements in:
- Extracting textual information from visually complex layouts.
- Recognizing tabular and hierarchical data structures.
- Generating accurate and contextually rich text outputs for document understanding.
Datasets used include a combination of publicly available document datasets (e.g., FUNSD, DocVQA) and proprietary annotated data for domain-specific applications.
## Intended Uses
VisionParser-VL-Expert is intended for:
- Extracting data from scanned documents, invoices, and forms.
- Parsing and analyzing structured layouts such as tables and charts.
- Generating textual summaries of visual content in documents.
- Supporting OCR systems by providing contextually enriched outputs.
## Limitations
While VisionParser-VL-Expert is powerful, it has certain limitations:
- May struggle with low-quality or heavily distorted images.
- Biases from training data might influence performance.
- Limited support for languages other than English.
- Performance can vary with highly complex or novel document layouts.
## How to Use
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "daemontatox/visionparser-vl-expert" # Replace with the actual model name
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
# Example usage with text and image
prompt = "Extract key details from the document: "
image_path = "path/to/your/document_image.jpg" # Replace with your image path
inputs = tokenizer(prompt, images=image_path, return_tensors="pt")
outputs = model.generate(**inputs)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
```
## Acknowledgements
Special thanks to the Unsloth team for their robust tools enabling efficient fine-tuning. This model was developed with the help of open-source libraries and community datasets.
|