Update README.md
Browse files
README.md
CHANGED
@@ -1,5 +1,106 @@
|
|
1 |
-
---
|
2 |
-
license: other
|
3 |
-
license_name: fair-ai-public-license-1.0-sd
|
4 |
-
license_link: https://freedevproject.org/faipl-1.0-sd/
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: other
|
3 |
+
license_name: fair-ai-public-license-1.0-sd
|
4 |
+
license_link: https://freedevproject.org/faipl-1.0-sd/
|
5 |
+
datasets:
|
6 |
+
- KBlueLeaf/danbooru2023-webp-4Mpixel
|
7 |
+
- KBlueLeaf/danbooru2023-sqlite
|
8 |
+
language:
|
9 |
+
- en
|
10 |
+
library_name: diffusers
|
11 |
+
pipeline_tag: text-to-image
|
12 |
+
---
|
13 |
+
|
14 |
+
# Kohaku XL Zeta
|
15 |
+
join us: https://discord.gg/tPBsKDyRR5
|
16 |
+
|
17 |
+
## Highlights
|
18 |
+
- Resume from Kohaku-XL-Epsilon rev2
|
19 |
+
- More stable, long/detailed prompt is not a requirement now.
|
20 |
+
- Better fidelity on style and character, support more style.
|
21 |
+
- CCIP metric surpass Sanae XL anime. have over 2200 character with CCIP score > 0.9 in 3700 character set.
|
22 |
+
- Trained on both danbooru tags and natural language, better ability on nl caption.
|
23 |
+
- Trained on combined dataset, not only danbooru
|
24 |
+
- danbooru (7.6M images, last id 7832883, 2024/07/10)
|
25 |
+
- pixiv (filtered from 2.6M special set, will release the url set)
|
26 |
+
- pvc figure (around 30k images, internal source)
|
27 |
+
- realbooru (around 90k images, for regularization)
|
28 |
+
- 8.46M images in total
|
29 |
+
- Since the model is trained on both kind of caption, the ctx length limit is extended to 300.
|
30 |
+
|
31 |
+
|
32 |
+
## Usage (PLEASE READ THIS SECTION)
|
33 |
+
### Recommended Generation Settings
|
34 |
+
- resolution: 1024x1024 or similar pixel count
|
35 |
+
- cfg scale: 3.5~6.5
|
36 |
+
- sampler/scheduler:
|
37 |
+
- Euler (A) / any scheduler
|
38 |
+
- DPM++ series / exponential scheduler
|
39 |
+
- for other sampler, I personally recommend exponential scheduler.
|
40 |
+
- step: 12~50
|
41 |
+
|
42 |
+
### Prompt Gen
|
43 |
+
DTG series prompt gen can still be used on KXL zeta.
|
44 |
+
A brand new prompt gen for cooperating both tag and nl caption is under developing.
|
45 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/630593e2fca1d8d92b81d2a1/HYUT5u3DS1bRhAOYmrFe5.png)
|
46 |
+
|
47 |
+
### Prompt Format
|
48 |
+
As same as Kohaku XL Epsilon or Delta, but you can replace "general tags" with "natural language caption".
|
49 |
+
You can also put both together.
|
50 |
+
|
51 |
+
### Special Tags
|
52 |
+
- Quality tags: masterpiece, best quality, great quality, good quality, normal quality, low quality, worst quality
|
53 |
+
- Rating tags: safe, sensitive, nsfw, explicit
|
54 |
+
- Date tags: newest, recent, mid, early, old
|
55 |
+
|
56 |
+
#### Rating tags
|
57 |
+
General: safe
|
58 |
+
Sensitive: sensitive
|
59 |
+
Questionable: nsfw
|
60 |
+
Explicit: nsfw, explicit
|
61 |
+
|
62 |
+
## Dataset
|
63 |
+
For better ability on some certain concepts, I use full danbooru dataset instead of filterd one.
|
64 |
+
Than use crawled Pixiv dataset (from 3~5 tag with popularity sort) as addon dataset.
|
65 |
+
Since Pixiv's search system only allow 5000 page per tag so there is not much meaningful image, and some of them are duplicated with danbooru set(but since I want to reinforce these concept I directly ignore the duplication)
|
66 |
+
|
67 |
+
As same as kxl eps rev2, I add realbooru and pvc figure images for more flexibility on concept/style.
|
68 |
+
|
69 |
+
## Training
|
70 |
+
- Hardware: Quad RTX 3090s
|
71 |
+
- Num Train Images: 8,468,798
|
72 |
+
- Total Epoch: 1
|
73 |
+
- Total Steps: 16548
|
74 |
+
- Training Time: 430 hours (wall time)
|
75 |
+
- Batch Size: 4
|
76 |
+
- Grad Accumulation Step: 32
|
77 |
+
- Equivalent Batch Size: 512
|
78 |
+
- Optimizer: Lion8bit
|
79 |
+
- Learning Rate: 1e-5 for UNet / TE training disabled
|
80 |
+
- LR Scheduler: Constant (with warmup)
|
81 |
+
- Warmup Steps: 100
|
82 |
+
- Weight Decay: 0.1
|
83 |
+
- Betas: 0.9, 0.95
|
84 |
+
- Min SNR Gamma: 5
|
85 |
+
- Debiased Estimation Loss: Enabled
|
86 |
+
- IP Noise Gamma: 0.05
|
87 |
+
- Resolution: 1024x1024
|
88 |
+
- Min Bucket Resolution: 256
|
89 |
+
- Max Bucket Resolution: 4096
|
90 |
+
- Mixed Precision: FP16
|
91 |
+
- Caption Tag Dropout: 0.2
|
92 |
+
- Caption Group Dropout: 0.2 (for dropping tag/nl caption entirely)
|
93 |
+
|
94 |
+
|
95 |
+
## Why do you still use SDXL but not any Brand New DiT-Based Models?
|
96 |
+
Why do you think HunYuan or SD3 or Flux or AuraFlow will be better choice even if they are slower than SDXL and more difficult to finetune? <br>
|
97 |
+
Why do you think DiT-based will be better choice even if the DiT paper use 9 times sample seen to surpass LDM-4? <br>
|
98 |
+
Do you know the most of "improvements" of these "DiT models" is mostly about dataset and scaling? <br>
|
99 |
+
Do you know "UNet" in SDXL have more than 1.75B or 70% parameter in transformer block?
|
100 |
+
|
101 |
+
Unless any one give me reasonable compute resource or any team release efficient enough DiT or I will not train any DiT-based anime base model. <br>
|
102 |
+
But if you give me 8xH100 for an year, I can even train lot of DiT from scratch (If you want)
|
103 |
+
|
104 |
+
|
105 |
+
## License:
|
106 |
+
Fair-AI-public-1.0-sd
|