Safetensors
huangzhiyuan's picture
first commit
16019c2
metadata
license: apache-2.0
base_model:
  - InternVL/InternVL2-26B

SpiritSight Agent: Advanced GUI Agent with One Look

📄 Paper • 🤖 Models • 📚 Datasets (Coming soon…)

Introduction

SpiritSight-Agent is a vision-based, end-to-end GUI agent that excels in GUI navigation tasks across various GUI platforms.

Models

We recommend fine-tuning the base model on custom data.

Model Checkpoint Size License
SpiritSight-Agent-2B-base 🤗 HF Link 2B InternVL
SpiritSight-Agent-8B-base 🤗 HF Link 8B InternVL
SpiritSight-Agent-26B-base 🤗 HF Link 26B InternVL

Datasets

Coming soon.

Inference

conda create -n spiritsight-agent python=3.9

pip install -r requirements.txt
pip install flash-attn==2.3.6 --no-build-isolation

python infer_SSAgent-26B.py

Citation

If you find this repo useful for your research, please kindly cite our paper:

@misc{huang2025spiritsightagentadvancedgui,
      title={SpiritSight Agent: Advanced GUI Agent with One Look}, 
      author={Zhiyuan Huang and Ziming Cheng and Junting Pan and Zhaohui Hou and Mingjie Zhan},
      year={2025},
      eprint={2503.03196},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2503.03196},
}

Acknowledgments

We thank the following amazing projects that truly inspired us: