[ Gemini Open Source Implementation 🤖 ]
The open source implementation of Gemini, the model that will "eclipse ChatGPT", it seems to work by directly taking in all modalities without an encoder for some kind which means that the encoding is built into the modal.
The input sequences for Gemini consist of texts, audio, images, and videos. These inputs are transformed into tokens, which are then processed by a transformer. Subsequently, conditional decoding takes place to generate image outputs.
Interestingly, the architecture of Gemini bears resemblance to Fuyu's architecture but is expanded to encompass multiple modalities. Instead of utilizing a visual transformer (vit) encoder, Gemini simply feeds image embeddings directly into the transformer.
For Gemini, the token inputs will likely be indicated by special modality tokens such as [IMG], , [AUDIO], or . Codi, a component of Gemini, also employs conditional generation and makes use of the tokenized outputs.
To implement this model effectively, I intend to initially focus on the image embeddings to ensure their smooth integration. Subsequently, I will proceed with incorporating audio embeddings and then video embeddings.
GITHUB:
https://github.com/kyegomez/Gemini
My thread for more implementation details:
https://twitter.com/KyeGomezB/status/1732487867107340622
Clem, would you be open minded to collaborate? We're almost done implementing this model and we should democratize it to millions of Humans around the world!
great !
If the other modalities are encoded PaLI-style, then I don't think special modality tokens would be needed. LLaVa also does not use modality tokens.