|
--- |
|
license: apache-2.0 |
|
tags: |
|
- merge |
|
base_model: |
|
- CohereForAI/aya-23-8B |
|
- google/siglip-base-patch16-256-multilingual |
|
datasets: |
|
- maya-multimodal/pretrain |
|
- MBZUAI/palo_multilingual_dataset |
|
language: |
|
- en |
|
- hi |
|
- fr |
|
- ru |
|
- zh |
|
- ar |
|
- ja |
|
- es |
|
pipeline_tag: image-text-to-text |
|
library_name: transformers |
|
--- |
|
|
|
# Maya: A Multilingual Vision Language Model |
|
|
|
Maya is an instruction-finetuned multilingual multimodal model that expands multimodal capabilities to eight languages with an emphasis on data quality and cultural sensitivity. Built on the LLaVA framework, Maya includes a newly created pre-training dataset designed to support multilingual and culturally aware VLM development. |
|
|
|
## Model Description |
|
|
|
- **Developed by:** Cohere For AI Community |
|
- **Model type:** Multimodal Vision-Language Model |
|
- **Language(s):** English, Chinese, French, Spanish, Russian, Japanese, Arabic, Hindi |
|
- **License:** Apache 2.0 |
|
- **Related Paper:** [Maya: An Instruction Finetuned Multilingual Multimodal Model](https://arxiv.org/abs/2412.07112) |
|
|
|
## Model Details |
|
|
|
Maya leverages the lightweight architecture to provide a compact yet powerful multimodal experience, with several key features: |
|
|
|
- Built on LLaVA framework using Aya-23 8B model |
|
- Uses SigLIP for vision encoding with multilingual adaptability |
|
- Supports 8 languages with strong cultural understanding |
|
- Trained on toxicity-filtered dataset for safer deployment |
|
|
|
### Model Architecture |
|
|
|
- **Base Model:** Aya-23 8B |
|
- **Vision Encoder:** SigLIP (multilingual) |
|
- **Training Data:** 558,000 images with multilingual annotations |
|
- **Context Length:** 8K tokens |
|
- **Parameters:** 8 billion |
|
|
|
## Intended Uses |
|
|
|
Maya is designed for: |
|
|
|
- Multilingual visual question answering |
|
- Cross-cultural image understanding |
|
- Image captioning in multiple languages |
|
- Visual reasoning tasks |
|
- Document understanding |
|
|
|
## Usage |
|
|
|
```bash |
|
# Clone the Github repository |
|
git clone https://github.com/nahidalam/maya |
|
|
|
# Change the working directory |
|
cd maya |
|
``` |
|
|
|
```python |
|
# Run the following code |
|
from llava.eval.talk2maya import run_vqa_model |
|
|
|
# Define inputs |
|
question = "Try identify what plane this is, based on the design." |
|
image_path = "./llava/eval/claude_plane_test_2.jpeg" |
|
|
|
# Run model |
|
answer = run_vqa_model( |
|
question=question, |
|
image_file=image_path |
|
) |
|
``` |
|
|
|
## Limitations |
|
|
|
- Limited to 8 languages currently |
|
- Requires high-quality images for optimal performance |
|
- May not capture nuanced cultural contexts in all cases |
|
- Performance varies across languages and tasks |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Maya has been developed with attention to bias mitigation and safety: |
|
|
|
- Dataset filtered for toxic content |
|
- Cultural sensitivity evaluations performed |
|
- Regular bias assessments conducted |
|
- Limited to high-quality, vetted training data |
|
|
|
However, users should be aware that: |
|
- Model may still exhibit biases present in training data |
|
- Performance may vary across different cultural contexts |
|
- Not suitable for critical decision-making applications |
|
|
|
## Training Details |
|
|
|
Maya was trained using: |
|
- 558,000 curated images |
|
- Multilingual annotations in 8 languages |
|
- Toxicity-filtered dataset |
|
- 8xH100 GPUs with 80GB DRAM |
|
- Batch size of 32 (per device) |
|
- Learning rate of 1e-3 with cosine scheduler |
|
|
|
## Citation |
|
|
|
```bibtex |
|
@misc{alam2024mayainstructionfinetunedmultilingual, |
|
title={Maya: An Instruction Finetuned Multilingual Multimodal Model}, |
|
author={Nahid Alam and Karthik Reddy Kanjula and Surya Guthikonda and Timothy Chung and Bala Krishna S Vegesna and Abhipsha Das and Anthony Susevski and Ryan Sze-Yin Chan and S M Iftekhar Uddin and Shayekh Bin Islam and Roshan Santhosh and Snegha A and Drishti Sharma and Chen Liu and Isha Chaturvedi and Genta Indra Winata and Ashvanth. S and Snehanshu Mukherjee and Alham Fikri Aji}, |
|
year={2024}, |
|
eprint={2412.07112}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2412.07112}, |
|
} |
|
``` |
|
|
|
## Contact |
|
|
|
For questions or feedback about Maya, please: |
|
- Open an issue on our [GitHub repository](https://github.com/nahidalam/maya) |
|
- Contact the maintainers at: [email protected], [email protected] |