DreamBooth Hackathon 🏆

Welcome to the DreamBooth Hackathon! In this competition, you’ll personalise a Stable Diffusion model by fine-tuning it on a handful of your own images. To do so, we’ll use a technique called DreamBooth, which allows one to implant a subject (e.g. your pet or favourite dish) into the output domain of the model such that it can be synthesized with a unique identifier in the prompt.

Let’s dive in!

Prerequisites

Before diving into this notebook, you should read the:

Unit 3 README that contains a deep dive into Stable Diffusion
DreamBooth blog post to get a sense of what’s possible with this technique
Hugging Face blog post on best practices for fine-tuning Stable Diffusion with DreamBooth

🚨 Note: the code in this notebook requires at least 14GB of GPU vRAM and is a simplified version of the official training script provided in 🤗 Diffusers. It produces decent models for most applications, but we recommend experimenting with the advanced features like class preservation loss & fine-tuning the text encoder if you have at least 24GB vRAM available. Check out the 🤗 Diffusers docs for more details.

What is DreamBooth?

DreamBooth is a technique to teach new concepts to Stable Diffusion using a specialized form of fine-tuning. If you’re on Twitter or Reddit, you may have seen people using this technique to create (often hilarious) avatars of themselves. For example, here’s what Andrej Karpathy would look like as a cowboy (you may need to run the cell to see the output):

%%html
<blockquote class="twitter-tweet"><p lang="en" dir="ltr">Stableboost auto-suggests a few hundred prompts by default but you can generate additional variations for any one prompt that seems to be giving fun/interesting results, or adjust it in any way: <a href="https://t.co/qWmadiXftP">pic.twitter.com/qWmadiXftP</a></p>&mdash; Andrej Karpathy (@karpathy) <a href="https://twitter.com/karpathy/status/1600578187141840896?ref_src=twsrc%5Etfw">December 7, 2022</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>

The way DreamBooth works is as follows:

Collect around 10-20 input images of a subject (e.g., your dog) and define a unique identifier [V] that refers to the subject. This identifier is usually some made up word like flffydog which is implanted in different text prompts at inference time to place the subject in different contexts.
Fine-tune the diffusion model by providing the images together with a text prompt like “A photo of a [V] dog” that contains the unique identifier and class name (i.e., “dog” in this example).
(Optionally) Apply a special class-specific prior preservation loss, which leverages the semantic prior that the model has on the class and encourages it to generate diverse instances belong to the subject’s class by injecting the class name in the text prompt. In practice, this step is only really needed for human faces and can be skipped for the themes we’ll be exploring in this hackathon.

An overview of the DreamBooth technique is shown in the image below:

What can DreamBooth do?

Besides putting your subject in interesting locations, DreamBooth can be used for text-guided view synthesis, where the subject is viewed from different viewpoints as shown in the example below:

DreamBooth can also be used to modify properties of the subject, such as colour or mixing up animal species!

Now that we’ve seen some of the cool things DreamBooth can do, let’s start training our own models!

Step 1: Setup

If you’re running this notebook on Google Colab or Kaggle, run the cell below to install the required libraries:

%pip install -qqU diffusers transformers bitsandbytes accelerate ftfy datasets

If you’re running on Kaggle, you’ll need to install the latest PyTorch version to work with 🤗 Accelerate:

# Uncomment and run if using Kaggle's notebooks. You may need to restart the notebook afterwards
# %pip install -U torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116

To be able to push your model to the Hub and make it appear on the DreamBooth Leaderboard, there are a few more steps to follow. First you have to create an access token with write access from your Hugging Face account and then execute the following cell and input your token:

from huggingface_hub import notebook_login

notebook_login()

The final step is to install Git LFS:

%%capture
!sudo apt -qq install git-lfs
!git config --global credential.helper store

Step 2: Pick a theme

This competition is composed of 5 themes, where each theme will collect models belong to the following categories:

Animal 🐨: Use this theme to generate images of your pet or favourite animal hanging out in the Acropolis, swimming, or flying in space.
Science 🔬: Use this theme to generate cool synthetic images of galaxies, proteins, or any domain of the natural and medical sciences.
Food 🍔: Use this theme to tune Stable Diffusion on your favourite dish or cuisine.
Landscape 🏔: Use this theme to generate beautiful landscapes of your faourite mountain, lake, or garden.
Wildcard 🔥: Use this theme to go wild and create Stable Diffusion models for any category of your choosing!

We’ll be giving out prizes to the top 3 most liked models per theme, and you’re encouraged to submit as many models as you want! Run the cell below to create a dropdown widget where you can select the theme you wish to submit to:

import ipywidgets as widgets

theme = "animal"
drop_down = widgets.Dropdown(
    options=["animal", "science", "food", "landscape", "wildcard"],
    description="Pick a theme",
    disabled=False,
)


def dropdown_handler(change):
    global theme
    theme = change.new


drop_down.observe(dropdown_handler, names="value")
display(drop_down)

>>> print(f"You've selected the {theme} theme!")

You've selected the animal theme!

Step 3: Create an image dataset and upload it to the Hub

Once you’ve picked a theme, the next step is to create a dataset of images for that theme and upload it to the Hugging Face Hub:

You’ll need around 10-20 images of the subject that you wish to implant in the model. These can be photos you’ve taken or downloaded from platforms like Unsplash. Alternatively, you can take a look at any of the image datasets on the Hugging Face Hub for inspiration.
For best results, we recommend using images of your subject from different angles and perspectives.

Once you’ve collected your images in a folder, you can upload them to the Hub by using the UI to drag and drop your images. See this guide for more details, or watch the video below:

>>> from IPython.display import YouTubeVideo

>>> YouTubeVideo("HaN6qCr_Afc")

Alternatively, you can load your dataset locally using the imagefolder feature of 🤗 Datasets and then push it to the Hub:

from datasets import load_dataset

dataset = load_dataset("imagefolder", data_dir="your_folder_of_images")
# Push to Hub
dataset.push_to_hub("dreambooth-hackathon-images")
dataset = dataset['train']

Once you’ve created your dataset, you can download it by using the load_dataset() function as follows:

from datasets import load_dataset

dataset_id = "lewtun/corgi"  # CHANGE THIS TO YOUR {hub_username}/{dataset_id}
dataset = load_dataset(dataset_id, split="train")
dataset

Now that we have our dataset, let’s define a helper function to view a few of the images:

>>> from PIL import Image


>>> def image_grid(imgs, rows, cols):
...     assert len(imgs) == rows * cols
...     w, h = imgs[0].size
...     grid = Image.new("RGB", size=(cols * w, rows * h))
...     grid_w, grid_h = grid.size
...     for i, img in enumerate(imgs):
...         grid.paste(img, box=(i % cols * w, i // cols * h))
...     return grid


>>> num_samples = 4
>>> image_grid(dataset["image"][:num_samples], rows=1, cols=num_samples)

If this looks good, you can move onto the next step - creating a PyTorch dataset for training with DreamBooth.

Step 3: Create a training dataset

To create a training set for our images we need a few components:

An instance prompt that is used to prime the model at the start of training. In most cases, using “a photo of [identifier][class noun]” works quite well, e.g., “a photo of ccorgi dog” for our cute Corgi pictures.
- Note: it is recommended that you pick a unique / made up word like ccorgi to describe your subject. This will ensure a common word in the model’s vocabulary isn’t overwritten.
A tokenizer to convert the instance prompt into input IDs that can be fed to the text encoder of Stable Diffusion.
A set of image transforms, notably resizing the images to a common shape and normalizing the pixel values to a common mean and standard distribution.

With this in mind, let’s start by defining the instance prompt:

>>> name_of_your_concept = "ccorgi"  # CHANGE THIS ACCORDING TO YOUR SUBJECT
>>> type_of_thing = "dog"  # CHANGE THIS ACCORDING TO YOUR SUBJECT
>>> instance_prompt = f"a photo of {name_of_your_concept} {type_of_thing}"
>>> print(f"Instance prompt: {instance_prompt}")

Instance prompt: a photo of ccorgi dog

Next, we need to create a PyTorch Dataset object that implements the __len__ and __getitem__ dunder methods:

from torch.utils.data import Dataset
from torchvision import transforms


class DreamBoothDataset(Dataset):
    def __init__(self, dataset, instance_prompt, tokenizer, size=512):
        self.dataset = dataset
        self.instance_prompt = instance_prompt
        self.tokenizer = tokenizer
        self.size = size
        self.transforms = transforms.Compose(
            [
                transforms.Resize(size),
                transforms.CenterCrop(size),
                transforms.ToTensor(),
                transforms.Normalize([0.5], [0.5]),
            ]
        )

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, index):
        example = {}
        image = self.dataset[index]["image"]
        example["instance_images"] = self.transforms(image)
        example["instance_prompt_ids"] = self.tokenizer(
            self.instance_prompt,
            padding="do_not_pad",
            truncation=True,
            max_length=self.tokenizer.model_max_length,
        ).input_ids
        return example

Great, let’s now check this works by loading the CLIP tokenizer associated with the text encoder of the original Stable Diffusion model, and then creating the training dataset:

from transformers import CLIPTokenizer

# The Stable Diffusion checkpoint we'll fine-tune
model_id = "CompVis/stable-diffusion-v1-4"
tokenizer = CLIPTokenizer.from_pretrained(
    model_id,
    subfolder="tokenizer",
)

train_dataset = DreamBoothDataset(dataset, instance_prompt, tokenizer)
train_dataset[0]

Step 4: Define a data collator

Now that we have a training dataset, the next thing we need is to define a data collator. A data collator is a function that collects elements in a batch of data and applies some logic to form a single tensor we can provide to the model. If you’d to learn more, you can check out this video from the Hugging Face Course:

>>> YouTubeVideo("-RPeakdlHYo")

For DreamBooth, our data collator need to provide the model with the input IDs from the tokenizer and the pixel values from the images as a stacked tensor. The function below does the trick:

import torch


def collate_fn(examples):
    input_ids = [example["instance_prompt_ids"] for example in examples]
    pixel_values = [example["instance_images"] for example in examples]
    pixel_values = torch.stack(pixel_values)
    pixel_values = pixel_values.to(memory_format=torch.contiguous_format).float()

    input_ids = tokenizer.pad({"input_ids": input_ids}, padding=True, return_tensors="pt").input_ids

    batch = {
        "input_ids": input_ids,
        "pixel_values": pixel_values,
    }
    return batch

Step 5: Load the components of the Stable Diffusion pipeline

We nearly have all the pieces ready for training! As you saw in the Unit 3 notebook on Stable Diffusion, the pipeline is composed of several models:

A text encoder that converts the prompts into text embeddings. Here we’re using CLIP since it’s the encoder used to train Stable Diffusion v1-4.
A VAE or variational autoencoder that converts the images to compressed representations (i.e., latents) and decompresses them at inference time.
A UNet that applies the denoising operation on the latent of the VAE.

We can load all these components using the 🤗 Diffusers and 🤗 Transformers libraries as follows:

from diffusers import AutoencoderKL, UNet2DConditionModel
from transformers import CLIPFeatureExtractor, CLIPTextModel

text_encoder = CLIPTextModel.from_pretrained(model_id, subfolder="text_encoder")
vae = AutoencoderKL.from_pretrained(model_id, subfolder="vae")
unet = UNet2DConditionModel.from_pretrained(model_id, subfolder="unet")
feature_extractor = CLIPFeatureExtractor.from_pretrained("openai/clip-vit-base-patch32")

Step 6: Fine-tune the model

Now comes the fun part - training our model with DreamBooth! As shown in the Hugging Face’s blog post, the most essential hyperparameters to tweak are the learning rate and number of training steps.

In general, you’ll get better results with a lower learning rate at the expense of needing to increase the number of training steps. The values below are a good starting point, but you may need to adjust them according to your dataset:

learning_rate = 2e-06
max_train_steps = 400

Next, let’s wrap the other hyperparameters we need in a Namespace object to make it easier to configure the training run:

from argparse import Namespace

args = Namespace(
    pretrained_model_name_or_path=model_id,
    resolution=512,  # Reduce this if you want to save some memory
    train_dataset=train_dataset,
    instance_prompt=instance_prompt,
    learning_rate=learning_rate,
    max_train_steps=max_train_steps,
    train_batch_size=1,
    gradient_accumulation_steps=1,  # Increase this if you want to lower memory usage
    max_grad_norm=1.0,
    gradient_checkpointing=True,  # Set this to True to lower the memory usage
    use_8bit_adam=True,  # Use 8bit optimizer from bitsandbytes
    seed=3434554,
    sample_batch_size=2,
    output_dir="my-dreambooth",  # Where to save the pipeline
)

The final step is to define a training_function() function that wraps the training logic and can be passed to 🤗 Accelerate to handle training on 1 or more GPUs. If this is the first time you’re using 🤗 Accelerate, check out this video to get a quick overview of what it can do:

>>> YouTubeVideo("s7dy8QRgjJ0")

The details should look familiar to what we saw in Units 1 & 2 when we trained our own diffusion models from scratch:

import math

import torch.nn.functional as F
from accelerate import Accelerator
from accelerate.utils import set_seed
from diffusers import DDPMScheduler, PNDMScheduler, StableDiffusionPipeline
from diffusers.pipelines.stable_diffusion import StableDiffusionSafetyChecker
from torch.utils.data import DataLoader
from tqdm.auto import tqdm


def training_function(text_encoder, vae, unet):

    accelerator = Accelerator(
        gradient_accumulation_steps=args.gradient_accumulation_steps,
    )

    set_seed(args.seed)

    if args.gradient_checkpointing:
        unet.enable_gradient_checkpointing()

    # Use 8-bit Adam for lower memory usage or to fine-tune the model in 16GB GPUs
    if args.use_8bit_adam:
        import bitsandbytes as bnb

        optimizer_class = bnb.optim.AdamW8bit
    else:
        optimizer_class = torch.optim.AdamW

    optimizer = optimizer_class(
        unet.parameters(),  # Only optimize unet
        lr=args.learning_rate,
    )

    noise_scheduler = DDPMScheduler(
        beta_start=0.00085,
        beta_end=0.012,
        beta_schedule="scaled_linear",
        num_train_timesteps=1000,
    )

    train_dataloader = DataLoader(
        args.train_dataset,
        batch_size=args.train_batch_size,
        shuffle=True,
        collate_fn=collate_fn,
    )

    unet, optimizer, train_dataloader = accelerator.prepare(unet, optimizer, train_dataloader)

    # Move text_encode and vae to gpu
    text_encoder.to(accelerator.device)
    vae.to(accelerator.device)

    # We need to recalculate our total training steps as the size of the training dataloader may have changed
    num_update_steps_per_epoch = math.ceil(len(train_dataloader) / args.gradient_accumulation_steps)
    num_train_epochs = math.ceil(args.max_train_steps / num_update_steps_per_epoch)

    # Train!
    total_batch_size = args.train_batch_size * accelerator.num_processes * args.gradient_accumulation_steps
    # Only show the progress bar once on each machine
    progress_bar = tqdm(range(args.max_train_steps), disable=not accelerator.is_local_main_process)
    progress_bar.set_description("Steps")
    global_step = 0

    for epoch in range(num_train_epochs):
        unet.train()
        for step, batch in enumerate(train_dataloader):
            with accelerator.accumulate(unet):
                # Convert images to latent space
                with torch.no_grad():
                    latents = vae.encode(batch["pixel_values"]).latent_dist.sample()
                    latents = latents * 0.18215

                # Sample noise that we'll add to the latents
                noise = torch.randn(latents.shape).to(latents.device)
                bsz = latents.shape[0]
                # Sample a random timestep for each image
                timesteps = torch.randint(
                    0,
                    noise_scheduler.config.num_train_timesteps,
                    (bsz,),
                    device=latents.device,
                ).long()

                # Add noise to the latents according to the noise magnitude at each timestep
                # (this is the forward diffusion process)
                noisy_latents = noise_scheduler.add_noise(latents, noise, timesteps)

                # Get the text embedding for conditioning
                with torch.no_grad():
                    encoder_hidden_states = text_encoder(batch["input_ids"])[0]

                # Predict the noise residual
                noise_pred = unet(noisy_latents, timesteps, encoder_hidden_states).sample
                loss = F.mse_loss(noise_pred, noise, reduction="none").mean([1, 2, 3]).mean()

                accelerator.backward(loss)
                if accelerator.sync_gradients:
                    accelerator.clip_grad_norm_(unet.parameters(), args.max_grad_norm)
                optimizer.step()
                optimizer.zero_grad()

            # Checks if the accelerator has performed an optimization step behind the scenes
            if accelerator.sync_gradients:
                progress_bar.update(1)
                global_step += 1

            logs = {"loss": loss.detach().item()}
            progress_bar.set_postfix(**logs)

            if global_step >= args.max_train_steps:
                break

        accelerator.wait_for_everyone()

    # Create the pipeline using the trained modules and save it
    if accelerator.is_main_process:
        print(f"Loading pipeline and saving to {args.output_dir}...")
        scheduler = PNDMScheduler(
            beta_start=0.00085,
            beta_end=0.012,
            beta_schedule="scaled_linear",
            skip_prk_steps=True,
            steps_offset=1,
        )
        pipeline = StableDiffusionPipeline(
            text_encoder=text_encoder,
            vae=vae,
            unet=accelerator.unwrap_model(unet),
            tokenizer=tokenizer,
            scheduler=scheduler,
            safety_checker=StableDiffusionSafetyChecker.from_pretrained("CompVis/stable-diffusion-safety-checker"),
            feature_extractor=feature_extractor,
        )
        pipeline.save_pretrained(args.output_dir)

Now that we have the function defined, let’s train it! Depending on the size of your dataset and type of GPU, this can take anywhere from 5 minutes to 1 hour to run:

>>> from accelerate import notebook_launcher

>>> num_of_gpus = 1  # CHANGE THIS TO MATCH THE NUMBER OF GPUS YOU HAVE
>>> notebook_launcher(training_function, args=(text_encoder, vae, unet), num_processes=num_of_gpus)

Launching training on one GPU.

If you’re running on a single GPU, you can free up some memory for the next section by copying the code below into a new cell and running it. For multi-GPU machines, 🤗 Accelerate doesn’t allow any cell to directly access the GPU with torch.cuda, so we don’t recommend using this trick in those cases:

with torch.no_grad():
    torch.cuda.empty_cache()

Step 7: Run inference and inspect generations

Now that we’ve trained the model, let’s generate some images with it to see how it fares! First we’ll load the pipeline from the output directory we save the model to:

pipe = StableDiffusionPipeline.from_pretrained(
    args.output_dir,
    torch_dtype=torch.float16,
).to("cuda")

Next, let’s generate a few images. The prompt variable will later be used to set the default on the Hugging Face Hub widget, so experiment a bit to find a good one. You might also want to try creating elaborate prompts with CLIP Interrogator:

>>> # Pick a funny prompt here and it will be used as the widget's default
>>> # when we push to the Hub in the next section
>>> prompt = f"a photo of {name_of_your_concept} {type_of_thing} in the Acropolis"

>>> # Tune the guidance to control how closely the generations follow the prompt
>>> # Values between 7-11 usually work best
>>> guidance_scale = 7

>>> num_cols = 2
>>> all_images = []
>>> for _ in range(num_cols):
...     images = pipe(prompt, guidance_scale=guidance_scale).images
...     all_images.extend(images)

>>> image_grid(all_images, 1, num_cols)

Step 8: Push your model to the Hub

If you’re happy with you model, the final step is to push it to the Hub and view it on the DreamBooth Leaderboard!

First, you’ll need to define a name for your model repo. By default, we use the unique identifier and class name, but feel free to change this if you want:

# Create a name for your model on the Hub. No spaces allowed.
model_name = f"{name_of_your_concept}-{type_of_thing}"

Next, add a brief description on the type of model you’ve trained or any other information you’d like to share:

# Describe the theme and model you've trained
description = f"""
This is a Stable Diffusion model fine-tuned on `{type_of_thing}` images for the {theme} theme.
"""

Finally, run the cell below to create a repo on the Hub and push all our files with a nice model card to boot:

>>> # Code to upload a pipeline saved locally to the hub
>>> from huggingface_hub import HfApi, ModelCard, create_repo, get_full_repo_name

>>> # Set up repo and upload files
>>> hub_model_id = get_full_repo_name(model_name)
>>> create_repo(hub_model_id)
>>> api = HfApi()
>>> api.upload_folder(folder_path=args.output_dir, path_in_repo="", repo_id=hub_model_id)

>>> content = f"""
... ---
... license: creativeml-openrail-m
... tags:
... - pytorch
... - diffusers
... - stable-diffusion
... - text-to-image
... - diffusion-models-class
... - dreambooth-hackathon
... - {theme}
... widget:
... - text: {prompt}
... ---

... # DreamBooth model for the {name_of_your_concept} concept trained by {api.whoami()["name"]} on the {dataset_id} dataset.

... This is a Stable Diffusion model fine-tuned on the {name_of_your_concept} concept with DreamBooth. It can be used by modifying the `instance_prompt`: **{instance_prompt}**

... This model was created as part of the DreamBooth Hackathon 🔥. Visit the [organisation page](https://huggingface.co./dreambooth-hackathon) for instructions on how to take part!

... ## Description

... {description}

... ## Usage

... ```python
... from diffusers import StableDiffusionPipeline

... pipeline = StableDiffusionPipeline.from_pretrained('{hub_model_id}')
... image = pipeline().images[0]
... image
... ```
... """

>>> card = ModelCard(content)
>>> hub_url = card.push_to_hub(hub_model_id)
>>> print(f"Upload successful! Model can be found here: {hub_url}")
>>> print(
...     f"View your submission on the public leaderboard here: https://huggingface.co./spaces/dreambooth-hackathon/leaderboard"
... )

Upload successful! Model can be found here: https://huggingface.co./lewtun/test-dogs/blob/main/README.md
View your submission on the public leaderboard here: https://huggingface.co./spaces/dreambooth-hackathon/leaderboard

Step 9: Celebrate 🥳

Congratulations, you’ve trained your very first DreamBooth model! You can train as many models as you want for the competition - the important thing is that the most liked models will win prizes so don’t forget to share your creation far and wide to get the most votes!

< > Update on GitHub

Diffusion Course