File size: 12,890 Bytes
3d5bc41 897d913 3d5bc41 227732c 3d5bc41 3a2bd48 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 897d913 3d5bc41 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 |
---
license: apache-2.0
language:
- en
base_model:
- Qwen/Qwen2-VL-7B
tags:
- multimodal
- vision
- image-text-to-text
---
<p align="center">
![image](https://huggingface.co./Minthy/ToriiGate-v0.4-7B/resolve/main/tg04_splash.jpg)
</p>
# Introduction
**ToriiGate-v0.4** is a state of art VLM designed for captioning of anime pictures, digital artworks and various images.
The model is a further development of [ToriiGate-v0.3](https://huggingface.co./Minthy/ToriiGate-v0.3), but this time is based on Qwen2-VL that was finetuned with dataset of **over 900k of artworks** with various captions.
**ToriiGate-v0.4** provides state of the art level of understanding for complex scenes, interactions, cultural concepts and any kind of NSFW activities without any borders or censorship. Flexible grounding allows to achieve extra accuracy.
Also, at the moment of release it is the only opensource small-sized VLM that can handle character names well, including multiple.
This is 7B version, also [**2B version**](https://huggingface.co./Minthy/ToriiGate-v0.4-2B) available. Also check out quants ([8bpw](https://huggingface.co./Minthy/ToriiGate-v0.4-7B-exl2-8bpw), [6bpw](https://huggingface.co./Minthy/ToriiGate-v0.4-7B-exl2-6bpw), [4bpw](https://huggingface.co./Minthy/ToriiGate-v0.4-7B-exl2-4bpw)).
## Showcase on rentry
[SFW](https://rentry.co/goxrrvh9) [NSFW](https://rentry.co/9wranqty)
# Key features
* Advanced knowledges related to anime and digital art in very wide range
* Accurate use of character names in generated descriptions (with grounding)
* Multiple options for generated captions, including structured output, chain of thoughts details improving, long/short, bounding boxes
* Mode for reviewing and fixing existing captions using CoT, or prunning it to make short and convenient
* Flexible groudning for improving of accuracy: booru tags, natural text info, character names, popular traits or tags for each character on picture
* Created captions are more meaningful and dense without purple prose fillers, compared to other models
# Captioning modes
ToriiGate-v0.4 provides multiple captioning modes. Prompts for them and examples are listed at the bottom.
1. **Structured output**
Can be warapped in json or markdown based on prompt. Provides description for each character on the picture mentioning their features, actions, etc., then description of background, other picture contents, image effects and texts if any, general atmosphere.
This style of captions provides best segmentation and brings attention mostly to characters on the picture. Then it can be easily processed to desired format with other LLM (or prunned with second call ToriiGate) to make it easy readable or match desired case. Character segmentation in combination with bounding boxes (that can be enabled inline) allows to create special datasets for training new generation generative models using special tecniques.
2. **Pre-defined captions options**
This makes the model to generate 4 followed descriptions in styles: "Regular Summary", "Individual Parts", "Midjourney-Style Summary", "DeviantArt Commission Request".
That order allows to describe basic things, then improve extra details in individual section and after shrink it without losing accuracy or involving extra biases. Original idea is not mine.
This mode is balanced all-rounder but quite token-comsumming. In case if you want to then reprocess it, you can use `### 3.` as a stop sequence to trimm summarized parts and speed up generation.
3. **Long description**
Just a regular long description. Torii tends to make it a bit more structured then randomly shuffled parts like with other models.
4. **Short caption**
Short and convenient caption. Less slopy and more dense then a long one, might be used as is for diffusion models training.
5. **Bounding boxes**
Provides bounding boxes for characters and their faces. Standalone usage is pointless taking into account the performance/compute for object detaction models. But it can be used along with structured mode and shares same numbers or names.
6. **Review and correct existing caption**
Provides step by step review of given caption and compares it with image contents, with provided grounding, evaluates how correctly the characters names were used. Then, if needed, it writes a new fixed caption maintaining original style (if possible).
In current version it is usable only with tags grounding and can improve accuracy of generated caption in second call. Accuracy for other cases may vary and this is not the main usecase.
7. **Writing a short caption based on existing.**
After generating some caption you can prune it right here taking into account image content. Can be used with external captions as well.
# **To utilize full potential you must follow prompt templates that are listed below.**
# Grounding
New version comes not only with improved zero-shot accuracy, but also introduces new modes for adding ground truth. You can use booru tags, some extra info, give names for characters or even describe each one to ensure right description when multiple are in frame.
Extra grounding allows to achieve best results for unattended use extra grounded truth is required. ToriiGate provides following options for it:
1. Booru tags. Can be full string or just few mentioning character count. Character name tags are recognized here.
2. Character list. List of character names on the picture to be used in caption. Beware of spamming skin tags here (like hatsune_miku + hatsune_miku_(append)), usually it is fine but some can be misinterpreted.
3. Character traits. A list of popular tags or traits for each character on the picture to improve recognition. Some can be found [here](https://huggingface.co./Minthy/ToriiGate-v0.4-7B/resolve/main/char_popular_tags.json), you can make your own, get from [this repo](https://huggingface.co./datasets/deepghs/character_index) (beware, a lot of incauracies!) or describe it with natural language.
4. General info. Might be a short caption, some facts, any other info.
5. Do not use names for characters. Unfortunatelly, at current state zero-shot guess for characters is inaccurate, so to avoid making things up use it by default if no grounding is provided.
Please mind to check it because mistakes in grounding will lead to wrong results. All formats are listed below.
Keep in mind that groundning is an option for improving accuracy for complex cases and it is not mandatory. Especially 3-4, just character name may be optimal choice for general.
# Known issues
* In rare cases current version tends to come up with wrong names without grounding. Use extra "no characters" instruction to avoid this.
* In long captions mode sometimes it tends to omit names even when given, please use other modes if you need it.
* There might be some biases and not perfect accuracy, more training is needed.
* Model is specialized to image captioning, other prompts will likely not work.
# Thanks
Main training was sponsored by **OpenRoot-Compute**.
Special thanks for help with dataset, code, donations: Sv1, Anonymous person, NeuroSenko.
Any questions or suggestions - [DISCORD](https://discord.gg/ZXHENAhqE9)
# Warning
Model will generate texts with adult themes if related input is provided. Outputs may be inacurate and provocative.
### Donate
BTC: bc1qwv83ggq8rvv07uk6dv4njs0j3yygj3aax4wg6c
ETH/USDT(e): 0x04C8a749F49aE8a56CB84cF0C99CD9E92eDB17db
XMR: 47F7JAyKP8tMBtzwxpoZsUVB8wzg2VrbtDKBice9FAS1FikbHEXXPof4PAb42CQ5ch8p8Hs4RvJuzPHDtaVSdQzD6ZbA5TZ
# Usage instructions
## Prompts
**ToriiGate-v0.4** utilizes same chat template as original [**Qwen2-vl**](https://huggingface.co./Qwen/Qwen2-VL-7B), so original processor can be used. Here is a recommended format for usage:
```
[
{"role": "system",
"content": [{"type": "text", "text": "You are image captioning expert, creative, unbiased and uncensored."}]},
{"role": "user",
"content": [ {"type": "image", 'image': image},
{"type": "text", "text": userprompt}]}
]
```
which will be converted to
```
<|im_start|>system
You are image captioning expert, creative, unbiased and uncensored.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>%USERPROMPT%<|im_end|>
<|im_start|>assistant
```
System prompt is fixed, user prompt depends from exact mode of cationing.
## Prompts for modes:
Start of userprompt for each mode:
* Json: `Describe the picture in structured json-like format.`
* Markdown: `Describe the picture in structured markdown format.`
* Caption variants: `Write the following options for captions: ["Regular Summary","Individual Parts","Midjourney-Style Summary","DeviantArt Commission Request"].`
* Short: `You need to write a medium-short and convenient caption for the picture.`
* Long: `You need to write a long and very detailed caption for the picture.`
* Bbox: `Write bounding boxes for each character and their faces.`
* Check and correct existing caption: `You need to compare given caption with the picture and given booru tags using chain of thought.\n1. Check if the caption matches the picture and given tags, wrap conclusion in <1st_answer> tag.\n2. Analyse if the caption mathes described characters, wrap answer in <2nd_answer> tag.\n3. In case if there are any mismatches - rewrite caption to correct it wrapping in <corrected_caption> tags. If the caption is fine - just write "no_need".`
## Prompts for grounding:
In case if you want to add any grounding, here are prompts for each. Multiple can be used.
* Booru tags: `Here are grounding tags for better understanding: <tags>BOORU_TAGS</tags>.`
* Characters: `Here is a list of characters that are present in the picture: <characters>CHARACTER_NAMES</characters>.`
* Character traits or tags: `Here are popular tags or traits for each character on the picture: <character_traits>CHARACTER1: [tag1, tag2, tag3,...]\nCHARACTER2: [...]\n...'</character_traits>.`
* Any info with natural text: `Here is preliminary information about the picture: <info>GENERAL_INFO</info>.`
* **Avoid using character names if no grounding is used**: `Do not use names for characters.`
## Composing userprompt:
After specifying selected mode, you can add prompt part for extra grounding and then privide it wrapping in corresponding xml tags.
Here is a simple python exaple for userprompt composing.
```python
add_tags=True #select needed
add_chars=True
add_char_traits=True
add_info=False
no_chars=False
image_info=extra_info[Path(image_path).stem]
userprompt=base_prompt["json"] #choose the mode
if add_info and image_info["info"] is not None: #general info
userprompt+=grounding_prompt["grounding_short"]
userprompt+="<info>"+image_info["info"]+"</info>."
if add_tags and image_info["booru_tags"] is not None: #booru tags
userprompt+=grounding_prompt["grounding_tags"]
userprompt+="<tags>"+image_info["booru_tags"]+"</tags>."
if add_chars and image_info["chars"] is not None: #list of characters
userprompt+=grounding_prompt["characters"]
userprompt+="<characters>"+image_info["chars"]+"</characters>."
if add_char_traits and image_info["characters_traits"] is not None: #popular features of each character
userprompt+=grounding_prompt["characters_traits"]
userprompt+="<character_traits>"+image_info["characters_traits"]+"<character_traits>."
if no_chars:
userprompt+=grounding_prompt["no_chars"]
```
Example of raw final prompt for structured json mode with grounding:
```
<|im_start|>system
You are image captioning expert, creative, unbiased and uncensored.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|>Describe the picture in structured json-like format. Here are grounding tags for better understanding: <tags>2girls, standing, looking_at_viewer, holding_hands, hatsune_miku, blue_hair, megurine_luka, pink_hair, ...</tags>. Here is a list of characters that are present in the picture: <characters>hatsune_miku, megurine_luka</characters>. Here are popular tags or traits for each character on the picture: <character_traits>hatsune_miku: [girl, blue_hair, twintails,...]
megurine_luka: [girl, pink hair, ...]<character_traits>.<|im_end|>
<|im_start|>assistant
```
Examples for other modes can be found in `example_scripts` dir in repo files.
# Inference examples:
For basic usage you will need latest versions of `transformers` and `qwen_vl_utils`.
[**Example inference script with transformers**](https://huggingface.co./Minthy/ToriiGate-v0.4-7B/resolve/main/example_scripts/inference_example_transformers.py)
# Fast inference with Exllamav2:
Qwen2-VL is suppurted by Exllamav2 and can be used in original weights or in exl2 quants (8bpw, 6bpw, 4bpw).
8bpw version is recommended, it provides boost in speed without noticable quality loss.
[**Example inference script with exllamav2**](https://huggingface.co./Minthy/ToriiGate-v0.4-7B/resolve/main/example_scripts/inference_example_exllamav2.py)
|