|
--- |
|
license: apache-2.0 |
|
base_model: |
|
- InternVL/InternVL2-26B |
|
--- |
|
|
|
## SpiritSight Agent: Advanced GUI Agent with One Look |
|
|
|
<p align="center"> |
|
<a href="https://arxiv.org/abs/2503.03196">π Paper</a> β’ |
|
<a href="https://huggingface.co./SenseLLM/SpiritSight-Agent-26B">π€ Models</a> β’ |
|
<a href="" style="pointer-events: none">π Datasets (Coming soonβ¦)</a> |
|
</p> |
|
|
|
|
|
## Introduction |
|
|
|
SpiritSight-Agent is a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms. |
|
|
|
 |
|
 |
|
|
|
|
|
## Models |
|
|
|
We recommend fine-tuning the base model on custom data. |
|
|
|
| Model | Checkpoint | Size | License| |
|
|:-------|:------------|:------|:--------| |
|
| SpiritSight-Agent-2B-base | π€ [HF Link](https://huggingface.co./SenseLLM/SpiritSight-Agent-2B) | 2B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) | |
|
| SpiritSight-Agent-8B-base | π€ [HF Link](https://huggingface.co./SenseLLM/SpiritSight-Agent-8B) | 8B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) | |
|
| SpiritSight-Agent-26B-base | π€ [HF Link](https://huggingface.co./SenseLLM/SpiritSight-Agent-26B) | 26B | [InternVL](https://github.com/OpenGVLab/InternVL/blob/main/LICENSE) | |
|
|
|
|
|
## Datasets |
|
|
|
Coming soon. |
|
|
|
|
|
## Inference |
|
|
|
```shell |
|
conda create -n spiritsight-agent python=3.9 |
|
|
|
pip install -r requirements.txt |
|
pip install flash-attn==2.3.6 --no-build-isolation |
|
|
|
python infer_SSAgent-26B.py |
|
``` |
|
|
|
|
|
## Citation |
|
|
|
If you find this repo useful for your research, please kindly cite our paper: |
|
``` |
|
@misc{huang2025spiritsightagentadvancedgui, |
|
title={SpiritSight Agent: Advanced GUI Agent with One Look}, |
|
author={Zhiyuan Huang and Ziming Cheng and Junting Pan and Zhaohui Hou and Mingjie Zhan}, |
|
year={2025}, |
|
eprint={2503.03196}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CV}, |
|
url={https://arxiv.org/abs/2503.03196}, |
|
} |
|
``` |
|
|
|
|
|
## Acknowledgments |
|
|
|
We thank the following amazing projects that truly inspired us: |
|
|
|
- [InternVL2](https://huggingface.co./OpenGVLab/InternVL2-8B) |
|
- [SeeClick]( https://github.com/njucckevin/SeeClick) |
|
- [Mind2Web](https://huggingface.co./datasets/osunlp/Multimodal-Mind2Web) |
|
- [GUI-Odyssey](https://github.com/OpenGVLab/GUI-Odyssey) |
|
- [AMEX](https://huggingface.co./datasets/Yuxiang007/AMEX) |
|
- [AndroidControl](https://github.com/google-research/google-research/tree/master/android_control) |
|
- [GUICourse](https://github.com/yiye3/GUICourse) |
|
|