panopstor commited on
Commit
66e8dce
1 Parent(s): d813033

Upload ff7rv4-1.ckpt

Browse files

This is an moderate-scale fine-tuned Stable Diffusion model with characters and scenery from the video game Final Fantasy 7 Remake. The novelty here is using "dreambooth" techniques but with an extended 1400+ image training set with more than a dozen concepts, including 5 character trained to the point of near indistinguishable quality and style of the game render engine via scenery concepts.

The 4 main characters Cloud Strife, Tifa Lockhart, Barret Wallace, and Aerith Gainsborough are very well represented with 120-140 images each, as is Jessie Rasberry with just 90 images.

Additionally, "Biggs ff7r", "Wedge ff7r", "Shinra Security Officer" and various side characters have limited training, typically 20-40 images. Biggs and Wedge were trained with "ff7r" as it if were a surname as they have no canonical surname to aid in training. Sephiroth, President Shinra, Rufus Shinra are poorly represented (<20 images each) but present, with output quality mirroring their balance in the training data.

Styles and scenery such as "midgar city", "streets of midgar city business district", "midgar slums district", "seventh heaven bar", "train car", "train station in midgar", and "aerial photo of midgar city" are trained as well. As each training image was annotated with its own caption, these concepts can be mixed. "Train station in midgar city slums district" will be different than "train station in midgar city business district", or you can say "Iron Man standing on the roof tops of midgar slums district" and so-forth and get compelling output.

Data set is compiled from screenshots from the game, captured, cropped, resized, and annotated with the assistance of BLIP, then modified to annotate the new content not detected by BLIP (the new concepts) i.e. replacing "a man..." to "cloud strife...".

Approximately 90 of the images in the data set are of more than one character, such as "cloud strife and barret wallace standing in a garden with a waterfall in the background". From prior attempts, these additional images greatly improve the ability to draw characters at inference time in group setting with a reduced propensity to bleed their clothing, styles, body types, etc.

Training was performed using Kane Wallmann's fork of Xavier's original DreamBooth Stable Diffusion repo on an RTX 3090 24GB. 3 epochs with 5 repeats were trained at LR 1e-6, then an additional epoch of ~4000 steps was performed at LR 5e-7 for a total of ~14,000 steps.

Regularization was used on categories man, woman, city, indoors, building, dog, sword, person, and group.

Files changed (2) hide show
  1. .gitattributes +1 -0
  2. ff7rv4-1.ckpt +3 -0
.gitattributes CHANGED
@@ -30,3 +30,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
30
  *.zip filter=lfs diff=lfs merge=lfs -text
31
  *.zst filter=lfs diff=lfs merge=lfs -text
32
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
30
  *.zip filter=lfs diff=lfs merge=lfs -text
31
  *.zst filter=lfs diff=lfs merge=lfs -text
32
  *tfevents* filter=lfs diff=lfs merge=lfs -text
33
+ ff7rv4-1.ckpt filter=lfs diff=lfs merge=lfs -text
ff7rv4-1.ckpt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1ba3eb041e7cb28d403f1c4feb1744db44d3f27ae5878b84249ae121f2be4b08
3
+ size 2132885089