Building Your Own Multimodal Large Model from Scratch
For the Chinese version of the README, please refer to δΈζζζ‘£.
Model Architecture π€
In the VLM (Visual Language Model), the visual component utilizes the CLIP
or SIGLIP
models, which have already achieved preliminary semantic alignment. A two-layer MLP is used for feature mapping. By overriding the forward
method of the QWenModel
, the corresponding image
tokens are replaced with visual features.
GitHub Repository π
The code for running the model can be found at Basic-Visual-Language-Model.
References π
Special thanks to the following projects for their great work π:
- https://github.com/WatchTower-Liu/VLM-learning/tree/main
- https://github.com/QwenLM/Qwen
- https://github.com/haotian-liu/LLaVA
Contact β
If you have any questions or ideas, feel free to reach out to me π:
I will respond as soon as I see your email!
Unable to determine this model's library. Check the
docs
.