pllava-13b-demo / DATA.md
cathyxl
added
f239efc

A newer version of the Gradio SDK is available: 5.5.0

Upgrade

Data

Instruction Training Data

For training, we leveraged the video instruction tuning data from Videochat2.

1. Download json annotation files from huggingface.

Dataset meta

2. Download the raw videos from the following links.

The video directories can be found in tasks/train/instruction_data.py. You can also change them to your own saved paths.

3. We also provide our processed json annotation files here.

Dataset meta

Evaluation Data & Others

Follow this section to obtain the evaluation open resources.

VCGBench

We refer to the VideoChatGPT video question answering evaluation as VCGBench in this repo. We followed the original repo to prepare the evaluation data.

MVBench

We follow the original Videochat2 repo in setting up the MVBench Evaluation. You can also find helpful resources at their huggingface repo

Videoqabench

We refer to all other video question answering benchmarks as videoqabench in this repo. They are mainly prepared folloing the original repos. Each listed:

  1. MSVD & MSRVTT

  2. Activity Net

  3. TGIF

Also other fantastic repo intergrating these benchmarks are helpful in the process of setting up the evaluation data:

Recaptioning

Inter4k

This is a dataset with 1000 samples of high resolution videos. We prepare the data folloing the instructions from their official website

Extending Reacptioning

The recaptioning part is designed to be extendable.

inference script tasks/eval/recaption/pllava_recaption.py would use a dataset class RecaptionDataset. The detailed information is kept in the data_list_info attribute as:

data_list_info = OrderedDict({
        # "Panda70M": OrderedDict(
        #     json_relpath="Panda70M/annotations.json", 
        #     prefix="DATAS/Recaption/Panda70M/videos", 
        #     data_type="video", 
        #     bound=False,
        #     key_rename_map={
        #         # 'caption': 'hint',
        #     },
        #     name_key='video_name',
        #     postfix=('mp4', 'mkv', 'webm'),
        #     recaption_type=RecaptionSample,
        # ), # don't has start & end
        "Inter4K": OrderedDict(
            json_relpath="Inter4K/annotations.json", 
            prefix="DATAS/Recaption/Inter4K/60fps/UHD", 
            data_type="video", 
            bound=False,
            key_rename_map={
                # 'caption': 'hint',
            },
            name_key='video_name',
            postfix=('mp4', 'mkv', 'webm'),
            recaption_type=CaptionSample,
        ), # don't has start & end
    })

It contains the path to a annotation json file where there is a list and each item of the list is a sample waiting for captioning. For example, the Inter4K/annotations.json is like:

[
    {
        "video_name": "973"
    },
    ...
]

and the directory DATAS/Recaption/Inter4K/60fps/UHD would look like:

$ ls DATAS/Recaption/Inter4K/60fps/UHD
1.mp4 134.mp4  170.mp4 ....

Naively, only the video is needed when captioning directly, therefore the annotation file only needs to contain the names of each video under the "prefix" directory.

Extending a dataset for captioning would consist of the folloing steps:

  1. have all the videos downloaded
  2. construct a annotation.json file with sepecific format.
  3. configure the recaption dataset here, where you would need to determine:
    • json_relpath: the annotation relative path
    • prefix: root directory for videos
    • postfix: a list containing all the file extensions for these videos

The other options are experimental, so stick with the default setting as in Inter4k. The recommended length of video is around 5-20 seconds.

p.s. "bound" is to make sure the video pass to the model doesn't have scene transition or so. This part wasn't tested, so set the bound to false and make sure the original videos files are single clip of a video. But always feel free to discover and contribute to PLLaVA!