Edit model card

Building Your Own Multimodal Large Model from Scratch

For the Chinese version of the README, please refer to δΈ­ζ–‡ζ–‡ζ‘£.

Model Architecture πŸ€–

In the VLM (Visual Language Model), the visual component utilizes the CLIP or SIGLIP models, which have already achieved preliminary semantic alignment. A two-layer MLP is used for feature mapping. By overriding the forward method of the QWenModel, the corresponding image tokens are replaced with visual features.

GitHub Repository 🏠

The code for running the model can be found at Basic-Visual-Language-Model.

References πŸ“š

Special thanks to the following projects for their great work πŸ™Œ:

Contact βœ‰

If you have any questions or ideas, feel free to reach out to me 😊:

[email protected]

I will respond as soon as I see your email!

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Examples
Unable to determine this model's library. Check the docs .