AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
Abstract
We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.
Community
Have github?
I found this on github, not sure if legit and it is just a bare template at the moment.
Are you still putting this github together?
https://github.com/kyegomez/AnyMAL
I found this on github, not sure if legit and it is just a bare template at the moment.
Are you still putting this github together?
https://github.com/kyegomez/AnyMAL
the account has many repositories for known papers, but they do not work and may have been written with AI. Please don't use/update/star them.
Are model weights or training(from llama 2 on audio, image & video) instructions available anywhere?
Unifying Multimodal Inputs: The Breakthrough of AnyMAL Language Model!
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/
Models citing this paper 4
Datasets citing this paper 0
No dataset linking this paper