|
# The VILD Dataset (VIdeo and Long-Description) |
|
|
|
This dataset is proposed from [VideoCLIP-XL](https://arxiv.org/abs/2410.00741). |
|
We establish an automatic data collection system, designed to aggregate sufficient and high-quality pairs from multiple data sources. |
|
We have successfully collected over 2M (VIdeo, Long Description) pairs, denoted as our VILD dataset. |
|
|
|
# Format |
|
```json |
|
{ |
|
"short_captions": [ |
|
"...", |
|
], |
|
"long_captions": [ |
|
"...", |
|
], |
|
"video_id": "..." |
|
} |
|
{ |
|
..... |
|
}, |
|
..... |
|
``` |
|
|
|
|
|
# Source |
|
~~~ |
|
@misc{wang2024videoclipxladvancinglongdescription, |
|
title={VideoCLIP-XL: Advancing Long Description Understanding for Video CLIP Models}, |
|
author={Jiapeng Wang and Chengyu Wang and Kunzhe Huang and Jun Huang and Lianwen Jin}, |
|
year={2024}, |
|
eprint={2410.00741}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2410.00741}, |
|
} |
|
~~~ |