File size: 12,890 Bytes

---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2-VL-7B
tags:
- multimodal
- vision
- image-text-to-text
---
<p align="center">

![image](https://huggingface.co./Minthy/ToriiGate-v0.4-7B/resolve/main/tg04_splash.jpg)

</p>

# Introduction
**ToriiGate-v0.4** is a state of art VLM designed for captioning of anime pictures, digital artworks and various images.

The model is a further development of [ToriiGate-v0.3](https://huggingface.co./Minthy/ToriiGate-v0.3), but this time is based on Qwen2-VL that was finetuned with dataset of **over 900k of artworks** with various captions.
**ToriiGate-v0.4** provides state of the art level of understanding for complex scenes, interactions, cultural concepts and any kind of NSFW activities without any borders or censorship. Flexible grounding allows to achieve extra accuracy.
Also, at the moment of release it is the only opensource small-sized VLM that can handle character names well, including multiple.

This is 7B version, also [**2B version**](https://huggingface.co./Minthy/ToriiGate-v0.4-2B) available. Also check out quants ([8bpw](https://huggingface.co./Minthy/ToriiGate-v0.4-7B-exl2-8bpw), [6bpw](https://huggingface.co./Minthy/ToriiGate-v0.4-7B-exl2-6bpw), [4bpw](https://huggingface.co./Minthy/ToriiGate-v0.4-7B-exl2-4bpw)).

## Showcase on rentry
[SFW](https://rentry.co/goxrrvh9) [NSFW](https://rentry.co/9wranqty)

# Key features
* Advanced knowledges related to anime and digital art in very wide range
* Accurate use of character names in generated descriptions (with grounding)
* Multiple options for generated captions, including structured output, chain of thoughts details improving, long/short, bounding boxes
* Mode for reviewing and fixing existing captions using CoT, or prunning it to make short and convenient
* Flexible groudning for improving of accuracy: booru tags, natural text info, character names, popular traits or tags for each character on picture
* Created captions are more meaningful and dense without purple prose fillers, compared to other models

# Captioning modes
ToriiGate-v0.4 provides multiple captioning modes. Prompts for them and examples are listed at the bottom.
1. **Structured output**

Can be warapped in json or markdown based on prompt. Provides description for each character on the picture mentioning their features, actions, etc., then description of background, other picture contents, image effects and texts if any, general atmosphere.
This style of captions provides best segmentation and brings attention mostly to characters on the picture. Then it can be easily processed to desired format with other LLM (or prunned with second call ToriiGate) to make it easy readable or match desired case. Character segmentation in combination with bounding boxes (that can be enabled inline) allows to create special datasets for training new generation generative models using special tecniques.

2. **Pre-defined captions options**

This makes the model to generate 4 followed descriptions in styles: "Regular Summary", "Individual Parts", "Midjourney-Style Summary", "DeviantArt Commission Request".
That order allows to describe basic things, then improve extra details in individual section and after shrink it without losing accuracy or involving extra biases. Original idea is not mine.
This mode is balanced all-rounder but quite token-comsumming. In case if you want to then reprocess it, you can use `### 3.` as a stop sequence to trimm summarized parts and speed up generation.

3. **Long description**

Just a regular long description. Torii tends to make it a bit more structured then randomly shuffled parts like with other models.

4. **Short caption**

Short and convenient caption. Less slopy and more dense then a long one, might be used as is for diffusion models training.

5. **Bounding boxes**

Provides bounding boxes for characters and their faces. Standalone usage is pointless taking into account the performance/compute for object detaction models. But it can be used along with structured mode and shares same numbers or names.

6. **Review and correct existing caption**

Provides step by step review of given caption and compares it with image contents, with provided grounding, evaluates how correctly the characters names were used. Then, if needed, it writes a new fixed caption maintaining original style (if possible).
In current version it is usable only with tags grounding and can improve accuracy of generated caption in second call. Accuracy for other cases may vary and this is not the main usecase.

7. **Writing a short caption based on existing.**

After generating some caption you can prune it right here taking into account image content. Can be used with external captions as well.

# **To utilize full potential you must follow prompt templates that are listed below.**

# Grounding

New version comes not only with improved zero-shot accuracy, but also introduces new modes for adding ground truth. You can use booru tags, some extra info, give names for characters or even describe each one to ensure right description when multiple are in frame.
Extra grounding allows to achieve best results for unattended use extra grounded truth is required. ToriiGate provides following options for it:
1. Booru tags. Can be full string or just few mentioning character count. Character name tags are recognized here.
2. Character list. List of character names on the picture to be used in caption. Beware of spamming skin tags here (like hatsune_miku + hatsune_miku_(append)), usually it is fine but some can be misinterpreted.
3. Character traits. A list of popular tags or traits for each character on the picture to improve recognition. Some can be found [here](https://huggingface.co./Minthy/ToriiGate-v0.4-7B/resolve/main/char_popular_tags.json), you can make your own, get from [this repo](https://huggingface.co./datasets/deepghs/character_index) (beware, a lot of incauracies!) or describe it with natural language.
4. General info. Might be a short caption, some facts, any other info.
5. Do not use names for characters. Unfortunatelly, at current state zero-shot guess for characters is inaccurate, so to avoid making things up use it by default if no grounding is provided.
Please mind to check it because mistakes in grounding will lead to wrong results. All formats are listed below.

Keep in mind that groundning is an option for improving accuracy for complex cases and it is not mandatory. Especially 3-4, just character name may be optimal choice for general.

# Known issues

* In rare cases current version tends to come up with wrong names without grounding. Use extra "no characters" instruction to avoid this.
* In long captions mode sometimes it tends to omit names even when given, please use other modes if you need it.
* There might be some biases and not perfect accuracy, more training is needed.
* Model is specialized to image captioning, other prompts will likely not work.

# Thanks

Main training was sponsored by **OpenRoot-Compute**.

Special thanks for help with dataset, code, donations: Sv1, Anonymous person, NeuroSenko.

Any questions or suggestions - [DISCORD](https://discord.gg/ZXHENAhqE9)

# Warning

Model will generate texts with adult themes if related input is provided. Outputs may be inacurate and provocative.

### Donate

BTC: bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c

ETH/USDT(e): 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db

XMR: 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ

# Usage instructions

## Prompts

**ToriiGate-v0.4** utilizes same chat template as original [**Qwen2-vl**](https://huggingface.co./Qwen/Qwen2-VL-7B), so original processor can be used. Here is a recommended format for usage:
```
[
{"role": "system",
"content": [{"type": "text", "text": "You are image captioning expert, creative, unbiased and uncensored."}]},
{"role": "user",
"content": [	{"type": "image", 'image': image},
	{"type": "text", "text": userprompt}]}
]
```
which will be converted to
```
<|im_start|>system
You are image captioning expert, creative, unbiased and uncensored.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>%USERPROMPT%<|im_end|>
<|im_start|>assistant
```

System prompt is fixed, user prompt depends from exact mode of cationing.

## Prompts for modes:

Start of userprompt for each mode:

* Json: `Describe the picture in structured json-like format.`
* Markdown: `Describe the picture in structured markdown format.`
* Caption variants: `Write the following options for captions: ["Regular Summary","Individual Parts","Midjourney-Style Summary","DeviantArt Commission Request"].`
* Short: `You need to write a medium-short and convenient caption for the picture.`
* Long: `You need to write a long and very detailed caption for the picture.`
* Bbox: `Write bounding boxes for each character and their faces.`
* Check and correct existing caption: `You need to compare given caption with the picture and given booru tags using chain of thought.\n1. Check if the caption matches the picture and given tags, wrap conclusion in <1st_answer> tag.\n2. Analyse if the caption mathes described characters, wrap answer in <2nd_answer> tag.\n3. In case if there are any mismatches - rewrite caption to correct it wrapping in <corrected_caption> tags. If the caption is fine - just write "no_need".`


## Prompts for grounding:

In case if you want to add any grounding, here are prompts for each. Multiple can be used.

* Booru tags: `Here are grounding tags for better understanding: <tags>BOORU_TAGS</tags>.`
* Characters: `Here is a list of characters that are present in the picture: <characters>CHARACTER_NAMES</characters>.`
* Character traits or tags: `Here are popular tags or traits for each character on the picture: <character_traits>CHARACTER1: [tag1, tag2, tag3,...]\nCHARACTER2: [...]\n...'</character_traits>.`
* Any info with natural text: `Here is preliminary information about the picture: <info>GENERAL_INFO</info>.`

* **Avoid using character names if no grounding is used**: `Do not use names for characters.`

## Composing userprompt:

After specifying selected mode, you can add prompt part for extra grounding and then privide it wrapping in corresponding xml tags.

Here is a simple python exaple for userprompt composing.

```python
add_tags=True #select needed
add_chars=True
add_char_traits=True
add_info=False
no_chars=False
image_info=extra_info[Path(image_path).stem]

userprompt=base_prompt["json"] #choose the mode

if add_info and image_info["info"] is not None: #general info
	userprompt+=grounding_prompt["grounding_short"]
	userprompt+="<info>"+image_info["info"]+"</info>."

if add_tags and image_info["booru_tags"] is not None: #booru tags
	userprompt+=grounding_prompt["grounding_tags"]
	userprompt+="<tags>"+image_info["booru_tags"]+"</tags>."

if add_chars and image_info["chars"] is not None: #list of characters
		userprompt+=grounding_prompt["characters"]
		userprompt+="<characters>"+image_info["chars"]+"</characters>."
	
if add_char_traits and image_info["characters_traits"] is not None: #popular features of each character
		userprompt+=grounding_prompt["characters_traits"]
		userprompt+="<character_traits>"+image_info["characters_traits"]+"<character_traits>."
if no_chars:
		userprompt+=grounding_prompt["no_chars"]
```
Example of raw final prompt for structured json mode with grounding:
```
<|im_start|>system
You are image captioning expert, creative, unbiased and uncensored.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Describe the picture in structured json-like format. Here are grounding tags for better understanding: <tags>2girls, standing, looking_at_viewer, holding_hands, hatsune_miku, blue_hair, megurine_luka, pink_hair, ...</tags>. Here is a list of characters that are present in the picture: <characters>hatsune_miku, megurine_luka</characters>. Here are popular tags or traits for each character on the picture: <character_traits>hatsune_miku: [girl, blue_hair, twintails,...]
megurine_luka: [girl, pink hair, ...]<character_traits>.<|im_end|>
<|im_start|>assistant
```
Examples for other modes can be found in `example_scripts` dir in repo files.

# Inference examples:

For basic usage you will need latest versions of `transformers` and `qwen_vl_utils`.

[**Example inference script with transformers**](https://huggingface.co./Minthy/ToriiGate-v0.4-7B/resolve/main/example_scripts/inference_example_transformers.py)

# Fast inference with Exllamav2:

Qwen2-VL is suppurted by Exllamav2 and can be used in original weights or in exl2 quants (8bpw, 6bpw, 4bpw).

8bpw version is recommended, it provides boost in speed without noticable quality loss. 

[**Example inference script with exllamav2**](https://huggingface.co./Minthy/ToriiGate-v0.4-7B/resolve/main/example_scripts/inference_example_exllamav2.py)