yeliudev's picture
Add pipeline tag (#1)
f3b0957 verified
metadata
license: bsd-3-clause
pipeline_tag: video-text-to-text

E.T. Chat

arXiv | Project Page | GitHub

E.T. Chat is a novel time-sensitive Video-LLM that reformulates timestamp prediction as an embedding matching problem, serving as a strong baseline on E.T. Bench. E.T. Chat consists of a visual encoder, a frame compressor, and a LLM. A special token <vid> is introduced to trigger frame embedding matching for timestamp prediction.

πŸ”– Model Details

Model Description

  • Developed by: Ye Liu
  • Model type: Multi-modal Large Language Model
  • Language(s): English
  • License: BSD-3-Clause

Training Data

The stage-1 checkpoint of E.T. Chat was trained from WebVid and LCS-558K datasets.

More Details

Please refer to our GitHub Repository for more details about this model.

πŸ“– Citation

Please kindly cite our paper if you find this project helpful.

@inproceedings{liu2024etbench,
  title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
  author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2024}
}