--- base_model: unsloth/Qwen2-VL-7B-Instruct tags: - document-parsing - information-extraction - transformers - unsloth - qwen2_vl license: apache-2.0 language: - en --- ![image](./image.webp) # VisionParser-VL-Expert **Developed by:** Daemontatox **Model Type:** Fine-tuned Vision-Language Model (VLM) **Base Model:** [unsloth/Qwen2-VL-7B-Instruct](https://huggingface.co./unsloth/Qwen2-VL-7B-Instruct) **Finetuned from model:** unsloth/Qwen2-VL-7B-Instruct **License:** apache-2.0 **Languages:** en **Tags:** - document-parsing - information-extraction - vision-language - unsloth - qwen2_vl ## Model Description VisionParser-VL-Expert is a fine-tuned version of [unsloth/Qwen2-VL-7B-Instruct](https://huggingface.co./unsloth/Qwen2-VL-7B-Instruct), designed specifically for document parsing and extraction tasks. It excels in interpreting and extracting structured data from images of documents, such as invoices, forms, and reports. The finetuning process utilized [QLoRA](https://huggingface.co./docs/transformers/main_classes/peft#qlora) with [Unsloth](https://github.com/unslothai/unsloth) and the Hugging Face TRL library, enabling efficient training with minimal resource overhead. This model demonstrates significant improvements in: - Extracting textual information from visually complex layouts. - Recognizing tabular and hierarchical data structures. - Generating accurate and contextually rich text outputs for document understanding. Datasets used include a combination of publicly available document datasets (e.g., FUNSD, DocVQA) and proprietary annotated data for domain-specific applications. ## Intended Uses VisionParser-VL-Expert is intended for: - Extracting data from scanned documents, invoices, and forms. - Parsing and analyzing structured layouts such as tables and charts. - Generating textual summaries of visual content in documents. - Supporting OCR systems by providing contextually enriched outputs. ## Limitations While VisionParser-VL-Expert is powerful, it has certain limitations: - May struggle with low-quality or heavily distorted images. - Biases from training data might influence performance. - Limited support for languages other than English. - Performance can vary with highly complex or novel document layouts. ## How to Use ```python from transformers import AutoModelForCausalLM, AutoTokenizer model_name = "daemontatox/visionparser-vl-expert" # Replace with the actual model name tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForCausalLM.from_pretrained(model_name) # Example usage with text and image prompt = "Extract key details from the document: " image_path = "path/to/your/document_image.jpg" # Replace with your image path inputs = tokenizer(prompt, images=image_path, return_tensors="pt") outputs = model.generate(**inputs) generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True) print(generated_text) ``` ## Acknowledgements Special thanks to the Unsloth team for their robust tools enabling efficient fine-tuning. This model was developed with the help of open-source libraries and community datasets.