microsoft
/

Phi-4-multimodal-instruct

Automatic Speech Recognition

text-generation

speech-summarization

speech-translation

visual-question-answering

phi-4-multimodal

Model card Files Files and versions Community

Why is the weight separated?

#4

by 2U1 - opened 1 day ago

2U1

1 day ago

Thanks for the great work!
I have some question about the model.

Why is the model separated?
I saw the model has lora_weights not merged and the code for loading the model sets the lora adapter and not merging. What is the purpose of this? Does merginig the wieghts harm the performance?
Is the lora weights the instruted tuned weight?

xwjabc

1 day ago

Thank you for your question!

Currently we separate base weight, vision lora weights, and speech lora weights, and use set_lora_adapter (https://huggingface.co./microsoft/Phi-4-multimodal-instruct/blob/main/modeling_phi4mm.py#L1980) for weights switching. The purpose is simply to make the lora weight switching a bit easier. If your scenario is only on certain modalities (e.g., vision-lang), I think it is better to merge the weight to get better speed. I don't think merging the weights will harm performance.
Yes, the lora weights are instruction-tuned for the corresponding modalities.

1 day ago

Hi @2U1 ,

This is by designed, we will dynamically use different lora weights for different inference mode (e.g., vision related, or speech related, or pure text)
Yes.

2U1

about 21 hours ago

@xwjabc @donniems Thanks for the response.
I'm making some code for easy use and easy customizable training code so I was a bit curious about this.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment