Model Card for Fine-tuned Phi-3.5-Vision-Instruct

This model is a fine-tuned version of Microsoft's Phi-3.5-Vision-Instruct, optimized for visual question answering, particularly in extracting accurate item measurements from images. It was developed as part of the Amazon ML Challenge 2024 by Team Fambruh. The model was trained on datasets containing real-world images, with a focus on precise measurement recognition, avoiding assumptions, and converting units to standard forms.

Model Details

Type: Vision-based question answering
Language: English
License: MIT
Base Model: Microsoft/Phi-3.5-Vision-Instruct

Uses

Direct Use

The model is designed for tasks involving visual analysis and measurement extraction, such as:

Extracting product dimensions from images.
Answering detailed questions based on image content.

Example task:

{
  "id": "1",
  "image": "image/0.jpg",
  "conversations": [
    {
      "from": "human",
      "value": "<image>
 1. Carefully examine all visual elements of the image, including any text or numerical values that may be present.  2. Strictly avoid making assumptions or providing any fabricated information. Your answer must be fully grounded in the image's content.  3. Respond with 'NA' in output field if No Relevant Information is Present: If the image does not contain clear information regarding the detail of the item, output 'NA' in the output field. 4. If the image contains commonly recognized shorthand units like 'lb', 'mg', etc., convert them to their corresponding valid forms ('pound', 'milligram', etc.). Do not send 10cm, 10lb, etc. 5. what's the width of the item in image?"
    },
    {
      "from": "gpt",
      "value": "analysis: The image shows a black velvet hanger with measurements labeled. The height of the hanger is 9.06 inches. , output : 9.06 inch "
    }
  ]
}

Out-of-Scope Use

The model should not be used for tasks requiring deep semantic understanding or creative interpretation beyond visual data.

Bias, Risks, and Limitations

The model may struggle with:

Unclear or text-heavy images.
Generalization to images outside its training data.

Recommendations

For tasks where precision is critical, users should verify the output, especially in cases requiring exact measurements.

Training Data and Procedure

The model was fine-tuned on image datasets that emphasized product measurements. It followed an fp16 mixed-precision training regime to ensure efficient computation.

vaibhavmeena
/

Phi-3.5-vision-instruct-amz-lora