Diffusers documentation

Create a dataset for training

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.31.0).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Create a dataset for training

There are many datasets on the Hub to train a model on, but if you can’t find one you’re interested in or want to use your own, you can create a dataset with the 🤗 Datasets library. The dataset structure depends on the task you want to train your model on. The most basic dataset structure is a directory of images for tasks like unconditional image generation. Another dataset structure may be a directory of images and a text file containing their corresponding text captions for tasks like text-to-image generation.

This guide will show you two ways to create a dataset to finetune on:

  • provide a folder of images to the --train_data_dir argument
  • upload a dataset to the Hub and pass the dataset repository id to the --dataset_name argument

💡 Learn more about how to create an image dataset for training in the Create an image dataset guide.

Provide a dataset as a folder

For unconditional generation, you can provide your own dataset as a folder of images. The training script uses the ImageFolder builder from 🤗 Datasets to automatically build a dataset from the folder. Your directory structure should look like:

data_dir/xxx.png
data_dir/xxy.png
data_dir/[...]/xxz.png

Pass the path to the dataset directory to the --train_data_dir argument, and then you can start training:

accelerate launch train_unconditional.py \
    --train_data_dir <path-to-train-directory> \
    <other-arguments>

Upload your data to the Hub

💡 For more details and context about creating and uploading a dataset to the Hub, take a look at the Image search with 🤗 Datasets post.

Start by creating a dataset with the ImageFolder feature, which creates an image column containing the PIL-encoded images.

You can use the data_dir or data_files parameters to specify the location of the dataset. The data_files parameter supports mapping specific files to dataset splits like train or test:

from datasets import load_dataset

# example 1: local folder
dataset = load_dataset("imagefolder", data_dir="path_to_your_folder")

# example 2: local files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset("imagefolder", data_files="path_to_zip_file")

# example 3: remote files (supported formats are tar, gzip, zip, xz, rar, zstd)
dataset = load_dataset(
    "imagefolder",
    data_files="https://download.microsoft.com/download/3/E/1/3E1C3F21-ECDB-4869-8368-6DEBA77B919F/kagglecatsanddogs_3367a.zip",
)

# example 4: providing several splits
dataset = load_dataset(
    "imagefolder", data_files={"train": ["path/to/file1", "path/to/file2"], "test": ["path/to/file3", "path/to/file4"]}
)

Then use the push_to_hub method to upload the dataset to the Hub:

# assuming you have ran the huggingface-cli login command in a terminal
dataset.push_to_hub("name_of_your_dataset")

# if you want to push to a private repo, simply pass private=True:
dataset.push_to_hub("name_of_your_dataset", private=True)

Now the dataset is available for training by passing the dataset name to the --dataset_name argument:

accelerate launch --mixed_precision="fp16"  train_text_to_image.py \
  --pretrained_model_name_or_path="stable-diffusion-v1-5/stable-diffusion-v1-5" \
  --dataset_name="name_of_your_dataset" \
  <other-arguments>

Next steps

Now that you’ve created a dataset, you can plug it into the train_data_dir (if your dataset is local) or dataset_name (if your dataset is on the Hub) arguments of a training script.

For your next steps, feel free to try and use your dataset to train a model for unconditional generation or text-to-image generation!

< > Update on GitHub