Safetensors
huangzhiyuan's picture
first commit
16019c2
---
license: apache-2.0
base_model:
- InternVL/InternVL2-26B
---
## SpiritSight Agent: Advanced GUI Agent with One Look
<p align="center">
<a href="https://arxiv.org/abs/2503.03196">πŸ“„ Paper</a> β€’
<a href="https://huggingface.co./SenseLLM/SpiritSight-Agent-26B">πŸ€– Models</a> β€’
<a href="" style="pointer-events: none">πŸ“š Datasets (Coming soon…)</a>
</p>
## Introduction
SpiritSight-Agent is a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms.
![](results.png)
![](results2.png)
## Models
We recommend fine-tuning the base model on custom data.
| Model | Checkpoint | Size | License|
|:-------|:------------|:------|:--------|
| SpiritSight-Agent-2B-base | πŸ€— [HF Link](https://huggingface.co./SenseLLM/SpiritSight-Agent-2B) | 2B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) |
| SpiritSight-Agent-8B-base | πŸ€— [HF Link](https://huggingface.co./SenseLLM/SpiritSight-Agent-8B) | 8B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) |
| SpiritSight-Agent-26B-base | πŸ€— [HF Link](https://huggingface.co./SenseLLM/SpiritSight-Agent-26B) | 26B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) |
## Datasets
Coming soon.
## Inference
```shell
conda create -n spiritsight-agent python=3.9
pip install -r requirements.txt
pip install flash-attn==2.3.6 --no-build-isolation
python infer_SSAgent-26B.py
```
## Citation
If you find this repo useful for your research, please kindly cite our paper:
```
@misc{huang2025spiritsightagentadvancedgui,
title={SpiritSight Agent: Advanced GUI Agent with One Look},
author={Zhiyuan Huang and Ziming Cheng and Junting Pan and Zhaohui Hou and Mingjie Zhan},
year={2025},
eprint={2503.03196},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2503.03196},
}
```
## Acknowledgments
We thank the following amazing projects that truly inspired us:
- [InternVL2](https://huggingface.co./OpenGVLab/InternVL2-8B)
- [SeeClick]( https://github.com/njucckevin/SeeClick)
- [Mind2Web](https://huggingface.co./datasets/osunlp/Multimodal-Mind2Web)
- [GUI-Odyssey](https://github.com/OpenGVLab/GUI-Odyssey)
- [AMEX](https://huggingface.co./datasets/Yuxiang007/AMEX)
- [AndroidControl](https://github.com/google-research/google-research/tree/master/android_control)
- [GUICourse](https://github.com/yiye3/GUICourse)