Reason for using PerceiverResampler/Cross-Attention/IDEFICS related modality layers?
I was wondering why it seems you are using a few ideas/layers from the IDEFICS model like PerceiverResampler over just linear projection/modality_projection before the input_merging? Have you all found it improves results/training or is it mostly just sticking to what IDEFICS had?
Also as it seems like the vision model being used (SiglipVisionModel) is not pretrained, is there a reason for that?
that's a great question!
the vision model (extracted from siglip) is pretrained, we are not starting from sratch.
as to the resampler, we found no to very minimal loss by pooling the vision hidden states into a short sequence that is then fed to the language model. the modality projection is mainly here to transform the image hidden size to the text hidden size for each of the positions
It would be great if it could take into account also images in the website screenshot and all the small details in it
It would be great if it could take into account also images in the website screenshot
yes that's part of WebSight v0.2 we are working on!