Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,65 @@
|
|
1 |
---
|
2 |
license: creativeml-openrail-m
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: creativeml-openrail-m
|
3 |
---
|
4 |
+
|
5 |
+
This is the beta version of the yama-no-susume character model (ヤマノススメ, aka encouragement of climb in English).
|
6 |
+
Unlike most of the models out there, this model is capable of generating **multi-character scenes** beyond images of a single character.
|
7 |
+
Of course, the result is still hit-or-miss, but it is possible to get **as many as 5 characters** right in one shot, and otherwise, you can always rely on inpainting.
|
8 |
+
Here are two examples (the first one done with some inpainting):
|
9 |
+
|
10 |
+
_Coming soon_
|
11 |
+
|
12 |
+
|
13 |
+
### Dataset description
|
14 |
+
|
15 |
+
The dataset contains around 40K images with the following composition
|
16 |
+
- 11424 anime screenshots from the four seasons of the anime
|
17 |
+
- 726 fan arts
|
18 |
+
- ~30K customized regularization images
|
19 |
+
|
20 |
+
The model is trained with a specific weighting scheme to balance between different concepts.
|
21 |
+
For example, the above three categories have weights respectively 0.3, 0.2, and 0.5.
|
22 |
+
Each category is itself split into many sub-categories in a hierarchical way.
|
23 |
+
For more detail on the data preparation process please refer to https://github.com/cyber-meow/anime_screenshot_pipeline
|
24 |
+
|
25 |
+
|
26 |
+
### Training Details
|
27 |
+
|
28 |
+
#### Trainer
|
29 |
+
The model was trained using [EveryDream1](https://github.com/victorchall/EveryDream-trainer) as
|
30 |
+
EveryDream seems to be the only trainer out there that supports sample weighting (through the use of `multiply.txt`).
|
31 |
+
Note that for future training it makes sense to migrate to [EveryDream2](https://github.com/victorchall/EveryDream2trainer).
|
32 |
+
|
33 |
+
#### Hardware and cost
|
34 |
+
The model was trained on runpod with an A6000 and cost me around 80 dollors.
|
35 |
+
However, I estimate a model of similar quality can be trained with fewer than 20 dollars on runpod.
|
36 |
+
|
37 |
+
#### Hyperparameter specification
|
38 |
+
|
39 |
+
- The model was first trained for 18000 steps, at batch size 8, lr 1e-6, resolution 640, and conditional dropping rate of 15%.
|
40 |
+
- After this, I modified a little the captions and trained the model for another 22000 steps, at batch size 8, lr 1e-6, reslution 704, and conditional dropping rate of 15%.
|
41 |
+
|
42 |
+
Note that as a consequence of the weighting scheme which translates into a number of different multiply for each image,
|
43 |
+
the count of repeat and epoch has a quite different meaning here.
|
44 |
+
For example, depending on the weighting, I have 400K~600K images (some images are used multiple times) in an epoch,
|
45 |
+
and therefore I did not even finish an entire epoch with the 40000 steps at batch size 8.
|
46 |
+
|
47 |
+
### Failures
|
48 |
+
|
49 |
+
I tried several things in this model (this is why I trained for so long), but I failed most of them.
|
50 |
+
|
51 |
+
- I put the number of people at the beginning of the captions, but at the end of 40000 steps the model still cannot count
|
52 |
+
(it can generate like 3~5 people when we prompt 3people).
|
53 |
+
- I use some tokens to describe the face position within a 5x5 grid but the model did not learn anything about these tokens.
|
54 |
+
I think this is either due to 1) face position being too abstract to learn, 2) data imbalance as I did not balance my training for this, or 3) captions not enough focused on these concepts (it is much longer and contains other information).
|
55 |
+
- As mentioned, the model can generate multi-character scenes but the success rate becomes lower and lower as we increase the number of character in the scene.
|
56 |
+
Character bleeding is always a hard problem to solve.
|
57 |
+
- The model is trained with 5% weight for hand images, but I doubt it helps in any kind.
|
58 |
+
|
59 |
+
Actually, I have a doubt whether the last 22000 steps really improved the models.
|
60 |
+
This is how I get my 20$ estimate taking into account that we can simply train at resolution 512 on 3090 with ED2.
|
61 |
+
|
62 |
+
|
63 |
+
### More Example Generations
|
64 |
+
|
65 |
+
_coming soon_
|