LLaVA-NDiNO_pt / README.md
m-elio's picture
Update README.md
f1f0da4 verified
metadata
license: llama3
datasets:
  - google/wit
  - coastalcph/multi_eurlex
language:
  - it
base_model:
  - meta-llama/Meta-Llama-3-8B
  - openai/clip-vit-large-patch14-336

Model Card for LLaVA-NDiNO_pt

Model description

LLaVA-NDiNO is a family of Large Vision Language Models (LVLMs) trained for the Italian language.

LLaVA-NDiNO_pt is a pre-trained model that has been trained over three different types of image-text data:

  • Wikipedia Image-Text Sections: Wikipedia image together with the text section in which the image appears
  • Wikipedia Image-Text Captions: Wikipedia image together with its caption
  • OCR PDF Documents: text in PDF documents extracted using Tesseract from MultiEurlex

If you are interested in more details regarding the training procedure, you can find the code we used at the following link:

  • Repository: https://github.com/swapUniba/LLaVA-NDiNO

  • Developed by: Elio Musacchio, Lucia Siciliani, Pierpaolo Basile, Giovanni Semeraro

  • Funded by: PNRR project FAIR - Future AI Research

  • Compute infrastructure: Leonardo supercomputer

  • Model type: LLaMA 3 + CLIP

  • Language(s) (NLP): Italian

  • License: Llama 3 Community License

Example usage

The model is not intended to be used without fine-tuning. It is recommended to further train it using the LLaVA-NeXT codebase.

Citation

@inproceedings{musacchioLLaVANDiNO,
  title={LLaVA-NDiNO: Empowering LLMs with Multimodality for the Italian Language},
  author={Musacchio, Elio and Siciliani, Lucia and Basile, Pierpaolo and Semeraro, Giovanni},
  booktitle={Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)},
  year={2024}
}