Diffusers documentation

Models

You are viewing v0.3.0 version. A newer version v0.31.0 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Models

Diffusers contains pretrained models for popular algorithms and modules for creating the next set of diffusion models. The primary function of these models is to denoise an input sample, by modeling the distribution $p\theta(\mathbf{x}{t-1}|\mathbf{x}_t)$. The models are built on the base class [‘ModelMixin’] that is a torch.nn.module with basic functionality for saving and loading models both locally and from the HuggingFace hub.

ModelMixin

class diffusers.ModelMixin

< >

( )

Base class for all models.

ModelMixin takes care of storing the configuration of the models and handles methods for loading, downloading and saving models.

  • config_name (str) — A filename under which the model should be stored when calling save_pretrained().

from_pretrained

< >

( pretrained_model_name_or_path: typing.Union[str, os.PathLike, NoneType] **kwargs )

Parameters

  • pretrained_model_name_or_path (str or os.PathLike, optional) — Can be either:

    • A string, the model id of a pretrained model hosted inside a model repo on huggingface.co. Valid model ids should have an organization name, like google/ddpm-celebahq-256.
    • A path to a directory containing model weights saved using save_config, e.g., ./my_model_directory/.
  • cache_dir (Union[str, os.PathLike], optional) — Path to a directory in which a downloaded pretrained model configuration should be cached if the standard cache should not be used.
  • torch_dtype (str or torch.dtype, optional) — Override the default torch.dtype and load the model under this dtype. If "auto" is passed the dtype will be automatically derived from the model’s weights.
  • force_download (bool, optional, defaults to False) — Whether or not to force the (re-)download of the model weights and configuration files, overriding the cached versions if they exist.
  • resume_download (bool, optional, defaults to False) — Whether or not to delete incompletely received files. Will attempt to resume the download if such a file exists.
  • proxies (Dict[str, str], optional) — A dictionary of proxy servers to use by protocol or endpoint, e.g., {'http': 'foo.bar:3128', 'http://hostname': 'foo.bar:4012'}. The proxies are used on each request.
  • output_loading_info(bool, optional, defaults to False) — Whether ot not to also return a dictionary containing missing keys, unexpected keys and error messages.
  • local_files_only(bool, optional, defaults to False) — Whether or not to only look at local files (i.e., do not try to download the model).
  • use_auth_token (str or bool, optional) — The token to use as HTTP bearer authorization for remote files. If True, will use the token generated when running diffusers-cli login (stored in ~/.huggingface).
  • revision (str, optional, defaults to "main") — The specific model version to use. It can be a branch name, a tag name, or a commit id, since we use a git-based system for storing models and other artifacts on huggingface.co, so revision can be any identifier allowed by git.
  • mirror (str, optional) — Mirror source to accelerate downloads in China. If you are from China and have an accessibility problem, you can set this option to resolve it. Note that we do not guarantee the timeliness or safety. Please refer to the mirror site for more information.

Instantiate a pretrained pytorch model from a pre-trained model configuration.

The model is set in evaluation mode by default using model.eval() (Dropout modules are deactivated). To train the model, you should first set it back in training mode with model.train().

The warning Weights from XXX not initialized from pretrained model means that the weights of XXX do not come pretrained with the rest of the model. It is up to you to train those weights with a downstream fine-tuning task.

The warning Weights from XXX not used in YYY means that the layer XXX is not used by YYY, therefore those weights are discarded.

Passing `use_auth_token=True“ is required when you want to use a private model.

Activate the special “offline-mode” to use this method in a firewalled environment.

num_parameters

< >

( only_trainable: bool = False exclude_embeddings: bool = False ) int

Parameters

  • only_trainable (bool, optional, defaults to False) — Whether or not to return only the number of trainable parameters
  • exclude_embeddings (bool, optional, defaults to False) — Whether or not to return only the number of non-embeddings parameters

Returns

int

The number of parameters.

Get number of (optionally, trainable or non-embeddings) parameters in the module.

save_pretrained

< >

( save_directory: typing.Union[str, os.PathLike] is_main_process: bool = True save_function: typing.Callable = <function save at 0x7f5de8d21670> )

Parameters

  • save_directory (str or os.PathLike) — Directory to which to save. Will be created if it doesn’t exist.
  • is_main_process (bool, optional, defaults to True) — Whether the process calling this is the main process or not. Useful when in distributed training like TPUs and need to call this function on all processes. In this case, set is_main_process=True only on the main process to avoid race conditions.
  • save_function (Callable) — The function to use to save the state dictionary. Useful on distributed training like TPUs when one need to replace torch.save by another method.

Save a model and its configuration file to a directory, so that it can be re-loaded using the [from_pretrained()](/docs/diffusers/v0.3.0/en/api/models#diffusers.ModelMixin.from_pretrained) class method.

UNet2DOutput

class diffusers.models.unet_2d.UNet2DOutput

< >

( sample: FloatTensor )

Parameters

  • sample (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Hidden states output. Output of last layer of model.

UNet2DModel

class diffusers.UNet2DModel

< >

( sample_size: typing.Optional[int] = None in_channels: int = 3 out_channels: int = 3 center_input_sample: bool = False time_embedding_type: str = 'positional' freq_shift: int = 0 flip_sin_to_cos: bool = True down_block_types: typing.Tuple[str] = ('DownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D', 'AttnDownBlock2D') up_block_types: typing.Tuple[str] = ('AttnUpBlock2D', 'AttnUpBlock2D', 'AttnUpBlock2D', 'UpBlock2D') block_out_channels: typing.Tuple[int] = (224, 448, 672, 896) layers_per_block: int = 2 mid_block_scale_factor: float = 1 downsample_padding: int = 1 act_fn: str = 'silu' attention_head_dim: int = 8 norm_num_groups: int = 32 norm_eps: float = 1e-05 )

Parameters

  • sample_size (torch.FloatTensor of shape (batch_size, num_channels, height, width), optional) — Input sample size.
  • in_channels (int, optional, defaults to 3) — Number of channels in the input image.
  • out_channels (int, optional, defaults to 3) — Number of channels in the output.
  • center_input_sample (bool, optional, defaults to False) — Whether to center the input sample.
  • time_embedding_type (str, optional, defaults to "positional") — Type of time embedding to use.
  • freq_shift (int, optional, defaults to 0) — Frequency shift for fourier time embedding.
  • flip_sin_to_cos (bool, optional, defaults to — obj:False): Whether to flip sin to cos for fourier time embedding.
  • down_block_types (Tuple[str], optional, defaults to — obj:("DownBlock2D", "AttnDownBlock2D", "AttnDownBlock2D", "AttnDownBlock2D")): Tuple of downsample block types.
  • up_block_types (Tuple[str], optional, defaults to — obj:("AttnUpBlock2D", "AttnUpBlock2D", "AttnUpBlock2D", "UpBlock2D")): Tuple of upsample block types.
  • block_out_channels (Tuple[int], optional, defaults to — obj:(224, 448, 672, 896)): Tuple of block output channels.
  • layers_per_block (int, optional, defaults to 2) — The number of layers per block.
  • mid_block_scale_factor (float, optional, defaults to 1) — The scale factor for the mid block.
  • downsample_padding (int, optional, defaults to 1) — The padding for the downsample convolution.
  • act_fn (str, optional, defaults to "silu") — The activation function to use.
  • attention_head_dim (int, optional, defaults to 8) — The attention head dimension.
  • norm_num_groups (int, optional, defaults to 32) — The number of groups for the normalization.
  • norm_eps (float, optional, defaults to 1e-5) — The epsilon for the normalization.

UNet2DModel is a 2D UNet model that takes in a noisy sample and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] return_dict: bool = True ) UNet2DOutput or tuple

Parameters

  • sample (torch.FloatTensor) — (batch, channel, height, width) noisy inputs tensor
  • timestep (torch.FloatTensor or float or `int) — (batch) timesteps
  • return_dict (bool, optional, defaults to True) — Whether or not to return a UNet2DOutput instead of a plain tuple.

Returns

UNet2DOutput or tuple

UNet2DOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

r

UNet2DConditionOutput

class diffusers.models.unet_2d_condition.UNet2DConditionOutput

< >

( sample: FloatTensor )

Parameters

  • sample (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Hidden states conditioned on encoder_hidden_states input. Output of last layer of model.

UNet2DConditionModel

class diffusers.UNet2DConditionModel

< >

( sample_size: typing.Optional[int] = None in_channels: int = 4 out_channels: int = 4 center_input_sample: bool = False flip_sin_to_cos: bool = True freq_shift: int = 0 down_block_types: typing.Tuple[str] = ('CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'CrossAttnDownBlock2D', 'DownBlock2D') up_block_types: typing.Tuple[str] = ('UpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D', 'CrossAttnUpBlock2D') block_out_channels: typing.Tuple[int] = (320, 640, 1280, 1280) layers_per_block: int = 2 downsample_padding: int = 1 mid_block_scale_factor: float = 1 act_fn: str = 'silu' norm_num_groups: int = 32 norm_eps: float = 1e-05 cross_attention_dim: int = 1280 attention_head_dim: int = 8 )

Parameters

  • sample_size (int, optional) — The size of the input sample.
  • in_channels (int, optional, defaults to 4) — The number of channels in the input sample.
  • out_channels (int, optional, defaults to 4) — The number of channels in the output.
  • center_input_sample (bool, optional, defaults to False) — Whether to center the input sample.
  • flip_sin_to_cos (bool, optional, defaults to False) — Whether to flip the sin to cos in the time embedding.
  • freq_shift (int, optional, defaults to 0) — The frequency shift to apply to the time embedding.
  • down_block_types (Tuple[str], optional, defaults to ("CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "CrossAttnDownBlock2D", "DownBlock2D")) — The tuple of downsample blocks to use.
  • up_block_types (Tuple[str], optional, defaults to ("UpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D", "CrossAttnUpBlock2D",)) — The tuple of upsample blocks to use.
  • block_out_channels (Tuple[int], optional, defaults to (320, 640, 1280, 1280)) — The tuple of output channels for each block.
  • layers_per_block (int, optional, defaults to 2) — The number of layers per block.
  • downsample_padding (int, optional, defaults to 1) — The padding to use for the downsampling convolution.
  • mid_block_scale_factor (float, optional, defaults to 1.0) — The scale factor to use for the mid block.
  • act_fn (str, optional, defaults to "silu") — The activation function to use.
  • norm_num_groups (int, optional, defaults to 32) — The number of groups to use for the normalization.
  • norm_eps (float, optional, defaults to 1e-5) — The epsilon to use for the normalization.
  • cross_attention_dim (int, optional, defaults to 1280) — The dimension of the cross attention features.
  • attention_head_dim (int, optional, defaults to 8) — The dimension of the attention heads.

UNet2DConditionModel is a conditional 2D UNet model that takes in a noisy sample, conditional state, and a timestep and returns sample shaped output.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< >

( sample: FloatTensor timestep: typing.Union[torch.Tensor, float, int] encoder_hidden_states: Tensor return_dict: bool = True ) UNet2DConditionOutput or tuple

Parameters

  • sample (torch.FloatTensor) — (batch, channel, height, width) noisy inputs tensor
  • timestep (torch.FloatTensor or float or `int) — (batch) timesteps
  • encoder_hidden_states (torch.FloatTensor) — (batch, channel, height, width) encoder hidden states
  • return_dict (bool, optional, defaults to True) — Whether or not to return a models.unet_2d_condition.UNet2DConditionOutput instead of a plain tuple.

Returns

UNet2DConditionOutput or tuple

UNet2DConditionOutput if return_dict is True, otherwise a tuple. When returning a tuple, the first element is the sample tensor.

r

DecoderOutput

class diffusers.models.vae.DecoderOutput

< >

( sample: FloatTensor )

Parameters

  • sample (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Decoded output sample of the model. Output of the last layer of the model.

Output of decoding method.

VQEncoderOutput

class diffusers.models.vae.VQEncoderOutput

< >

( latents: FloatTensor )

Parameters

  • latents (torch.FloatTensor of shape (batch_size, num_channels, height, width)) — Encoded output sample of the model. Output of the last layer of the model.

Output of VQModel encoding method.

VQModel

class diffusers.VQModel

< >

( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 3 sample_size: int = 32 num_vq_embeddings: int = 256 )

Parameters

  • in_channels (int, optional, defaults to 3) — Number of channels in the input image.
  • out_channels (int, optional, defaults to 3) — Number of channels in the output.
  • down_block_types (Tuple[str], optional, defaults to — obj:("DownEncoderBlock2D",)): Tuple of downsample block types.
  • up_block_types (Tuple[str], optional, defaults to — obj:("UpDecoderBlock2D",)): Tuple of upsample block types.
  • block_out_channels (Tuple[int], optional, defaults to — obj:(64,)): Tuple of block output channels.
  • act_fn (str, optional, defaults to "silu") — The activation function to use.
  • latent_channels (int, optional, defaults to 3) — Number of channels in the latent space.
  • sample_size (int, optional, defaults to 32) — TODO
  • num_vq_embeddings (int, optional, defaults to 256) — Number of codebook vectors in the VQ-VAE.

VQ-VAE model from the paper Neural Discrete Representation Learning by Aaron van den Oord, Oriol Vinyals and Koray Kavukcuoglu.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< >

( sample: FloatTensor return_dict: bool = True )

Parameters

  • sample (torch.FloatTensor) — Input sample.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a DecoderOutput instead of a plain tuple.

AutoencoderKLOutput

class diffusers.models.vae.AutoencoderKLOutput

< >

( latent_dist: DiagonalGaussianDistribution )

Parameters

  • latent_dist (DiagonalGaussianDistribution) — Encoded outputs of Encoder represented as the mean and logvar of DiagonalGaussianDistribution. DiagonalGaussianDistribution allows for sampling latents from the distribution.

Output of AutoencoderKL encoding method.

AutoencoderKL

class diffusers.AutoencoderKL

< >

( in_channels: int = 3 out_channels: int = 3 down_block_types: typing.Tuple[str] = ('DownEncoderBlock2D',) up_block_types: typing.Tuple[str] = ('UpDecoderBlock2D',) block_out_channels: typing.Tuple[int] = (64,) layers_per_block: int = 1 act_fn: str = 'silu' latent_channels: int = 4 sample_size: int = 32 )

Parameters

  • in_channels (int, optional, defaults to 3) — Number of channels in the input image.
  • out_channels (int, optional, defaults to 3) — Number of channels in the output.
  • down_block_types (Tuple[str], optional, defaults to — obj:("DownEncoderBlock2D",)): Tuple of downsample block types.
  • up_block_types (Tuple[str], optional, defaults to — obj:("UpDecoderBlock2D",)): Tuple of upsample block types.
  • block_out_channels (Tuple[int], optional, defaults to — obj:(64,)): Tuple of block output channels.
  • act_fn (str, optional, defaults to "silu") — The activation function to use.
  • latent_channels (int, optional, defaults to 4) — Number of channels in the latent space.
  • sample_size (int, optional, defaults to 32) — TODO

Variational Autoencoder (VAE) model with KL loss from the paper Auto-Encoding Variational Bayes by Diederik P. Kingma and Max Welling.

This model inherits from ModelMixin. Check the superclass documentation for the generic methods the library implements for all the model (such as downloading or saving, etc.)

forward

< >

( sample: FloatTensor sample_posterior: bool = False return_dict: bool = True )

Parameters

  • sample (torch.FloatTensor) — Input sample.
  • sample_posterior (bool, optional, defaults to False) — Whether to sample from the posterior.
  • return_dict (bool, optional, defaults to True) — Whether or not to return a DecoderOutput instead of a plain tuple.