Model Card for Fine-tuned Phi-3.5-Vision-Instruct
This model is a fine-tuned version of Microsoft's Phi-3.5-Vision-Instruct, optimized for visual question answering, particularly in extracting accurate item measurements from images. It was developed as part of the Amazon ML Challenge 2024 by Team Fambruh. The model was trained on datasets containing real-world images, with a focus on precise measurement recognition, avoiding assumptions, and converting units to standard forms.
Model Details
- Type: Vision-based question answering
- Language: English
- License: MIT
- Base Model: Microsoft/Phi-3.5-Vision-Instruct
Uses
Direct Use
The model is designed for tasks involving visual analysis and measurement extraction, such as:
- Extracting product dimensions from images.
- Answering detailed questions based on image content.
Example task:
{
"id": "1",
"image": "image/0.jpg",
"conversations": [
{
"from": "human",
"value": "<image>
1. Carefully examine all visual elements of the image, including any text or numerical values that may be present. 2. Strictly avoid making assumptions or providing any fabricated information. Your answer must be fully grounded in the image's content. 3. Respond with 'NA' in output field if No Relevant Information is Present: If the image does not contain clear information regarding the detail of the item, output 'NA' in the output field. 4. If the image contains commonly recognized shorthand units like 'lb', 'mg', etc., convert them to their corresponding valid forms ('pound', 'milligram', etc.). Do not send 10cm, 10lb, etc. 5. what's the width of the item in image?"
},
{
"from": "gpt",
"value": "analysis: The image shows a black velvet hanger with measurements labeled. The height of the hanger is 9.06 inches. , output : 9.06 inch "
}
]
}
Out-of-Scope Use
The model should not be used for tasks requiring deep semantic understanding or creative interpretation beyond visual data.
Bias, Risks, and Limitations
The model may struggle with:
- Unclear or text-heavy images.
- Generalization to images outside its training data.
Recommendations
For tasks where precision is critical, users should verify the output, especially in cases requiring exact measurements.
Training Data and Procedure
The model was fine-tuned on image datasets that emphasized product measurements. It followed an fp16 mixed-precision training regime to ensure efficient computation.
- Downloads last month
- 7
Model tree for vaibhavmeena/Phi-3.5-vision-instruct-amz-lora
Base model
microsoft/Phi-3.5-vision-instruct