ljleb/noobai11-animagine4

This is a merge of NoobAI Eps 1.1 (https://huggingface.co./Laxhar/noobai-XL-1.0) and Animagine 4.0 (https://huggingface.co./cagliostrolab/animagine-xl-4.0) that uses a technique inspired from Fisher-weighted averaging. (https://arxiv.org/abs/2111.09832)

The importance of each parameter is estimated from some notion of curvature of the loss landscape, and then for each corresponding parameter in both models, we simply average with respect to the estimated importance.

Naively applying the formulas from the paper gave poor results when trying to adapt them to diffusion models through ELBO, so I had to innovate a little bit.

The Approach

The goal is to estimate which parameters of each model are very important and which parameters are less important. The way I did this was by first computing gradients with respect to this objective:

$L = E_{x \sim p(x), t \propto \frac{1}{snr_t^2}} \big[\frac{L_0(x, t)}{C(x, t)}\big]$

where L_0 expresses what we are trying to optimize (A corresponds to NoobAI, B corresponds to Animagine):

$L_{0} (x, t) = ∣ ∣ f_{A} (x_{t}, t) - f_{B} (x_{t}, t) ∣ ∣_{2}^{2}$

and C is a constant equal to L_0:

$C(x, t) = L_0(x, t) ~ \text{(constant)}$

Note: as indicated in the formulae, C(x, t) is constant. The intent is to make the contribution to the accumulated absolute gradients (see below) constant for all timesteps. As C(x, t) = L_0(x, t), this means L = 1. However, expressing L in this way overlooks the way the gradients are computed. I think the formulation with the expected value is less ambiguous.

x_t and snr_t were taken from the DDPM paper (https://arxiv.org/abs/2006.11239):

$x_t = \sqrt{\bar{\alpha}_t} x_0 + \sqrt{1 - \bar{\alpha}_t} \epsilon \\ snr_t = \frac{\sqrt{\bar{\alpha}_t}}{\sqrt{1 - \bar{\alpha}_t}}$

It's important to note that L is not used to train any model here. Instead, we accumulate absolute gradients to estimate the importance of each parameter (explained below):

$s_{A, i} = E\big[ | \Delta_{\theta_{A, i}} L | \big] \\ s_{B, i} = E\big[ | \Delta_{\theta_{B, i}} L | \big]$

To put it simply, we compare the predictions of the models given the same inputs (or very similar) where we expect them to be different, and determine which parameters would contribute the most to reducing the gap between the models.

In Fisher-weighted averaging, typically the squared grads are used. However, with the budget I had, I couldn't afford to store gradients in something else than FP16. The loss cannot be scaled much more than it is already, as it leads to NaNs during backpropagation inside of the model, and some of the gradients are too small to be squared without having a tendency to underflow. I also tried taking the square root of the expected square gradients to see if it helped regularize the extremely flat distribution of expected square gradients that span entire range of FP16.

Overall, the approach that gave the best results was taking the expected value of the absolute grads, given the FP16 constraint.

In a sense, the parameters with higher expected absolute gradients have a higher slope in the loss landscape. This means that merging these parameters using a naive weighted average approach will cause the loss L to change much more than other parameters with smaller expected absolute gradients.

In our case with NoobAI and Animagine, as the loss landscape is highly non-linear, naively merging high slope parameters completely decimates the loss instead of improving it: the merge cannot even denoise anything anymore. We then want to move high slope parameters as little as possible, keep them in place as much as possible wherever we can.

To this end, we merge the models according to the weighted average equation used in the Fisher-weighted averaging paper:

$\alpha_i(\lambda) = \frac{\lambda s_{B, i}}{(1 - \lambda) s_{A, i} + \lambda s_{B,i}} \\ \theta_{M, i}(\lambda) = (1 - \alpha_i(\lambda)) \theta_{A, i} + \alpha_i(\lambda) \theta_{B, i}$

To put it simply, this weights each parameter in both models by their estimated importance. In a sense, we "pick" the most important parameter between NoobAI and Animagine for each parameter position to fill.

The distribution of timestep t sampled during grads accumulation is very important. The merge is very "glitchy" (high frequency artifacts, light dots, noise, etc.) if we do not follow exactly a distribution proportional to the inverse of the SNR. This is because otherwise, gradients related to low timesteps end up overwhelming those related to higher timesteps, which are more important at the end of the diffusion process (and beginning of the denoising process) where x_t is almost fully noise.

Merge Details

For this merge, I gathered 10K random danbooru images from the danbooru2023 dataset (https://huggingface.co./datasets/KBlueLeaf/danbooru2023-metadata-database/tree/main), each with at least 1024^2 pixels (not necessarily square). For NoobAI, following the guidance from the readme of the huggingface release, the prompt followed the format:

masterpiece, best quality, newest, absurdres, highres, safe, <ordered danbooru tags>

For Animagine, following the guidance from the readme of the huggingface release, the prompts follow this format:

<ordered danbooru tags>, masterpiece, high score, great score, absurdres

The prompts are then prepended for half of the dataset by a random artist following the distribution of artists in the Danbooru image board (only including users that were trained on, those that have a larger catalogue get more weight). In particular, at each gradient accumulation step, we either prepend an artist tag to the prompt of NoobAI (for even iterations) or that of Animagine (for odd iterations) but never both at the same time.

The goal of this prompting strategy is to make sure the artists distribution of the models is at least partially covered (there are too many artists to sample them all). It increases L_0 which helps with reducing noise in the calculated gradients, and activates slightly different paths in the two models which helps with covering an overall wider region of the loss landscape.

As we compare the outputs of the models directly, and not to an absolute expected epsilon noise map, this asymmetric prompting strategy does not affect too much the quality of accumulated gradients.

The merge uses a value for lambda that gives equal weight to both models. The gradients of Animagine are on average larger than those of NoobAI, so we need to tweak lambda if we want a merge that includes a proportional amount of parameters from both models. I am not aware of a closed form solution for lambda, so I used a bisecting algorithm to estimate it to a reasonable precision.

This gave a value of lambda = 0.224609375, which corresponds to the "DeadCenter" version.
Another opiniated value of lambda that I found aesthetically pleasing with some artist tags is lambda = 0.3564453125 (more a random value than an optimized one), which corresponds to the "MadHatter" version.