File size: 1,578 Bytes
3220d61 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 |
---
license: bsd-3-clause
---
# E.T. Chat
[arXiv](https://arxiv.org/abs/2409.18111) | [Project Page](https://polyu-chenlab.github.io/etbench) | [GitHub](https://github.com/PolyU-ChenLab/ETBench)
E.T. Chat is a novel time-sensitive Video-LLM that reformulates timestamp prediction as an embedding matching problem, serving as a strong baseline on E.T. Bench. E.T. Chat consists of a visual encoder, a frame compressor, and a LLM. A special token \<vid\> is introduced to trigger frame embedding matching for timestamp prediction.
## ๐ Model Details
### Model Description
- **Developed by:** Ye Liu
- **Model type:** Multi-modal Large Language Model
- **Language(s):** English
- **License:** BSD-3-Clause
### Training Data
The stage-2 checkpoint of E.T. Chat was trained from [VideoChatGPT](https://mbzuaiac-my.sharepoint.com/:f:/g/personal/hanoona_bangalath_mbzuai_ac_ae/EnLRDehrr8lGqHpC5w1zZ9QBnsiVffYy5vCv8Hl14deRcg?e=Ul5DUE) and [LLaVA-1.5-Instruct](https://github.com/haotian-liu/LLaVA?tab=readme-ov-file#visual-instruction-tuning) datasets.
### More Details
Please refer to our [GitHub Repository](https://github.com/PolyU-ChenLab/ETBench) for more details about this model.
## ๐ Citation
Please kindly cite our paper if you find this project helpful.
```
@inproceedings{liu2024etbench,
title={E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding},
author={Liu, Ye and Ma, Zongyang and Qi, Zhongang and Wu, Yang and Chen, Chang Wen and Shan, Ying},
booktitle={Neural Information Processing Systems (NeurIPS)},
year={2024}
}
```
|