FreedomIntelligence
/

LongLLaVA-9B

Image-Text-to-Text

text-generation

Inference Endpoints

Model card Files Files and versions Community

LongLLaVA-9B / README.md

Xidong's picture

Update README.md

71f09bf verified 26 days ago

|

2.28 kB

	---
	license: mit
	library_name: transformers
	pipeline_tag: image-text-to-text
	---
	![header](./assets/assets_header.png)

	<p align="center">
	📃 <a href="https://arxiv.org/abs/2409.02889" target="_blank">Paper</a> • 🌐 <a href="" target="_blank">Demo</a> • 📃 <a href="https://github.com/FreedomIntelligence/LongLLaVA" target="_blank">Github</a> • 🤗 <a href="https://huggingface.co./FreedomIntelligence/LongLLaVA-53B-A13B" target="_blank">LongLLaVA-53B-A13B</a>
	</p>

	![efficiency](./assets/assets_singleGPU.png)


	## 🌈 Update

	* [2024.09.05] LongLLaVA repo is published！🎉 The Code will

	## Architecture

	<details>
	<summary>Click to view the architecture image</summary>

	![Architecture Image](./assets/assets_arch.png)

	</details>


	## Results

	<details>
	<summary>Click to view the Results</summary>

	- Main Results
	![Main Results](./assets/assets_result1.png)
	- Diagnostic Results
	![Diagnostic Results](./assets/assets_diaresult.png)
	- Video-NIAH
	![Video-NIAH](./assets/assets_NIAH.png)

	</details>



	## Results reproduction


	### Evaluation

	- Preparation

	Get the model inference code from [Github](https://github.com/FreedomIntelligence/LongLLaVA).

	```bash
	git clone https://github.com/FreedomIntelligence/LongLLaVA.git
	```

	- Environment Setup

	```bash
	pip install -r requirements.txt
	```


	- Command Line Interface

	```bash
	python cli.py --model_dir path-to-longllava
	```


	- Model Inference

	```python
	query = 'What does the picture show?'
	image_paths = ['image_path1'] # image or video path

	from cli import Chatbot
	bot = Chatbot(path-to-longllava)
	output = bot.chat(query, image_paths)
	print(output) # Prints the output of the model
	```


	## Acknowledgement

	- [LLaVA](https://github.com/haotian-liu/LLaVA): Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

	## Citation

	```
	@misc{wang2024longllavascalingmultimodalllms,
	title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture},
	author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang},
	year={2024},
	eprint={2409.02889},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2409.02889},
	}
	```