omni-research
/

Tarsier2-Recap-7b

Model card Files Files and versions Community

Tarsier2-Recap-7b / README.md

0nejiawei's picture

update README.md

df460d0 17 days ago

|

history blame contribute delete

3.61 kB

	---
	license: apache-2.0
	tags:
	- video LLM
	---


	# Tarsier Model Card
	## Introduction
	Tarsier2-Recap-7b is build upon [Qwen2-VL-7B-Instruct](https://huggingface.co./Qwen/Qwen2-VL-7B-Instruct) by distilling the video description capabilities of Tarsier2-7b. Specifically, we finetuned Qwen2-VL-7B-Instruct on [Tarsier2-Recap-585K](https://huggingface.co./datasets/omni-research/Tarsier2-Recap-585K) for 2 epochs with a learning rate of 2e-5. Tarsier2-Recap-7b shares a similar video captioning ability as Tarsier2-7b, reaching an overall F1 score of 40.7% on [DREAM-1K](https://tarsier-vlm.github.io/), which is only behind Tarsier2-7b (42.0%) and surpasses GPT-4o's 39.2%. See the [Tarsier2 technical report](https://arxiv.org/abs/2501.07888) for more details.

	## Model details
	- Base Model: [Qwen2-VL-7B-Instruct](https://huggingface.co./Qwen/Qwen2-VL-7B-Instruct)
	- Training Data: [Tarsier2-Recap-585K](https://huggingface.co./datasets/omni-research/Tarsier2-Recap-585K)

	Model date:
	Tarsier2-Recap-7b was trained in December 2024.

	Paper or resources for more information:
	- github repo: https://github.com/bytedance/tarsier/tree/tarsier2
	- paper link: https://arxiv.org/abs/2501.07888
	- leaderboard: https://tarsier-vlm.github.io/

	## License
	Qwen/Qwen2-VL-7B-Instruct license.

	## Intended use
	Primary intended uses:
	The primary use of Tarsier is research on large multimodal models, especially video description.

	Primary intended users:
	The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

	## Model Performance
	### Video Description
	We evaluate Tarsier2-Recap-7b on DREAM-1K, a detailed video description benchmark featuring dynamic and diverse videos, assessing the model’s ability to describe fine-grained actions and events. Here is the evaluation result:
	![images](./assets/dream-1k_results.png)
	_Note: The results of Tarsier2-Recap-7b is different from the results we reported in Table 11 in the [Tarsier2 technical report](https://arxiv.org/abs/2501.07888), as Tarsier2-Recap-7b is more fully trained (2 epochs vs 1 epoch)._

	### Video Question-Answering
	We evalute Tarsier2-Recap-7b on [TVBench](https://paperswithcode.com/sota/video-question-answering-on-tvbench), a novel multiple-choice question-answering which requires a high level of temporal understanding. As Tarsier2-Recap-7b is only trained with video caption data, it needs some additional prompt to enduce it to conduct multi-choice question-answering tasks, see [TVBench](https://github.com/bytedance/tarsier/blob/tarsier2/data/annotations/TVBench.jsonl) samples as an example. Here is the evaluation result:

	\| Task \| Tarsier2-Recap-7b \| Tarsier2-7b \|
	\| ------- \| :--------: \| :-------: \|
	\| Action Antonym \| 91.2 \| 94.1 \|
	\| Action Count \| 43.1 \| 40.5 \|
	\| Action Localization \| 42.5 \| 37.5 \|
	\| Action Sequence \| 70.5 \| 72.3 \|
	\| Egocentric Sequence \| 22.0 \| 24.5 \|
	\| Moving Direction \| 37.1 \| 33.2 \|
	\| Object Count \| 46.6 \| 62.8 \|
	\| Object Shuffle \| 36.9 \| 31.6 \|
	\| Scene Transition \| 85.9 \| 88.1 \|
	\| Unexpected Action \| 28.0 \| 41.5 \|
	\| OVERALL \| 54.0 \| 54.7 \|


	## How to Use
	see https://github.com/bytedance/tarsier/tree/tarsier2?tab=readme-ov-file#usage (The tarsier2 branch!!!)

	Where to send questions or comments about the model:
	https://github.com/bytedance/tarsier/issues