swap-uniba
/

LLaVA-NDiNO_short_it

Text Generation

Model card Files Files and versions Community

LLaVA-NDiNO_short_it / README.md

m-elio's picture

Update README.md

6ce2dca verified 3 days ago

|

history blame contribute delete

3.24 kB

	---
	license: llama3
	datasets:
	- swap-uniba/the_cauldron_ita
	language:
	- it
	base_model:
	- meta-llama/Meta-Llama-3-8B
	- openai/clip-vit-large-patch14-336
	pipeline_tag: text-generation
	---

	# Model Card for LLaVA-NDiNO_pt_short_it

	## Model description

	<!-- Provide a quick summary of what the model is/does. -->

	LLaVA-NDiNO is a family of Large Vision Language Models (LVLMs) that have been trained for the Italian language.

	The model was trained by instruction-tuning [LLaMA 3 8B Base](https://huggingface.co./meta-llama/Meta-Llama-3-8B) and [CLIP Large 336](https://huggingface.co./openai/clip-vit-large-patch14-336) on an Italian machine-translated version of [The Cauldron](HuggingFaceM4/the_cauldron).

	If you are interested in more details regarding the training procedure, you can find the code we used at the following link:
	- Repository: https://github.com/swapUniba/LLaVA-NDiNO

	- Developed by: Elio Musacchio, Lucia Siciliani, Pierpaolo Basile, Giovanni Semeraro
	- Funded by: PNRR project FAIR - Future AI Research
	- Compute infrastructure: [Leonardo](https://www.hpc.cineca.it/systems/hardware/leonardo/) supercomputer
	- Model type: LLaMA 3 + CLIP
	- Language(s) (NLP): Italian
	- License: Llama 3 Community License
	- Finetuned from model: [swap-uniba/LLaVA-NDiNO_pt](https://huggingface.co./swap-uniba/LLaVA-NDiNO_pt)


	## Example Usage

	```python
	import torch
	import requests

	from PIL import Image
	from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration, set_seed

	model_name = "swap-uniba/LLaVA-NDiNO_short_it"

	processor = LlavaNextProcessor.from_pretrained(model_name)
	model = LlavaNextForConditionalGeneration.from_pretrained(model_name, torch_dtype=torch.bfloat16, low_cpu_mem_usage=True, device_map="auto")

	url = "https://www.barnorama.com/wp-content/uploads/2016/12/03-Confusing-Pictures.jpg"
	image = Image.open(requests.get(url, stream=True).raw)

	chat_template = "{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<\|start_header_id\|>' + message['role'] + '<\|end_header_id\|>\n\n'+ message['content'] \| trim + '<\|eot_id\|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<\|start_header_id\|>assistant<\|end_header_id\|>\n\n' }}{% endif %}"

	conversation = [
	{
	"role": "user",
	"content": "<image>\nCosa c'è di strano in questa immagine?"
	},
	]

	prompt = processor.apply_chat_template(conversation, chat_template, add_generation_prompt=True)
	inputs = processor(prompt, image, return_tensors="pt")

	set_seed(42)
	output = model.generate(**inputs, max_new_tokens=4096)

	print(processor.decode(output[0][inputs.input_ids.shape[1]:]))
	```

	## Citation

	```
	@inproceedings{musacchioLLaVANDiNO,
	title={LLaVA-NDiNO: Empowering LLMs with Multimodality for the Italian Language},
	author={Musacchio, Elio and Siciliani, Lucia and Basile, Pierpaolo and Semeraro, Giovanni},
	booktitle={Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)},
	year={2024}
	}
	```