|
--- |
|
license: mit |
|
library_name: transformers |
|
pipeline_tag: image-text-to-text |
|
--- |
|
![header](./assets/assets_header.png) |
|
|
|
<p align="center"> |
|
๐ <a href="https://arxiv.org/abs/2409.02889" target="_blank">Paper</a> โข ๐ <a href="" target="_blank">Demo</a> โข ๐ <a href="https://github.com/FreedomIntelligence/LongLLaVA" target="_blank">Github</a> โข ๐ค <a href="https://huggingface.co./FreedomIntelligence/LongLLaVA-53B-A13B" target="_blank">LongLLaVA-53B-A13B</a> |
|
</p> |
|
|
|
![efficiency](./assets/assets_singleGPU.png) |
|
|
|
|
|
## ๐ Update |
|
|
|
* **[2024.09.05]** LongLLaVA repo is published๏ผ๐ The Code will |
|
|
|
## Architecture |
|
|
|
<details> |
|
<summary>Click to view the architecture image</summary> |
|
|
|
![Architecture Image](./assets/assets_arch.png) |
|
|
|
</details> |
|
|
|
|
|
## Results |
|
|
|
<details> |
|
<summary>Click to view the Results</summary> |
|
|
|
- Main Results |
|
![Main Results](./assets/assets_result1.png) |
|
- Diagnostic Results |
|
![Diagnostic Results](./assets/assets_diaresult.png) |
|
- Video-NIAH |
|
![Video-NIAH](./assets/assets_NIAH.png) |
|
|
|
</details> |
|
|
|
|
|
|
|
## Results reproduction |
|
|
|
|
|
### Evaluation |
|
|
|
- Preparation |
|
|
|
Get the model inference code from [Github](https://github.com/FreedomIntelligence/LongLLaVA). |
|
|
|
```bash |
|
git clone https://github.com/FreedomIntelligence/LongLLaVA.git |
|
``` |
|
|
|
- Environment Setup |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
|
|
- Command Line Interface |
|
|
|
```bash |
|
python cli.py --model_dir path-to-longllava |
|
``` |
|
|
|
|
|
- Model Inference |
|
|
|
```python |
|
query = 'What does the picture show?' |
|
image_paths = ['image_path1'] # image or video path |
|
|
|
from cli import Chatbot |
|
bot = Chatbot(path-to-longllava) |
|
output = bot.chat(query, image_paths) |
|
print(output) # Prints the output of the model |
|
``` |
|
|
|
|
|
## Acknowledgement |
|
|
|
- [LLaVA](https://github.com/haotian-liu/LLaVA): Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond. |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{wang2024longllavascalingmultimodalllms, |
|
title={LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture}, |
|
author={Xidong Wang and Dingjie Song and Shunian Chen and Chen Zhang and Benyou Wang}, |
|
year={2024}, |
|
eprint={2409.02889}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2409.02889}, |
|
} |
|
``` |
|
|
|
|
|
|
|
|