Diffusers documentation

EasyAnimateTransformer3DModel

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.32.2).
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

EasyAnimateTransformer3DModel

A Diffusion Transformer model for 3D data from EasyAnimate was introduced by Alibaba PAI.

The model can be loaded with the following code snippet.

from diffusers import EasyAnimateTransformer3DModel

transformer = EasyAnimateTransformer3DModel.from_pretrained("alibaba-pai/EasyAnimateV5.1-12b-zh", subfolder="transformer", torch_dtype=torch.float16).to("cuda")

EasyAnimateTransformer3DModel

class diffusers.EasyAnimateTransformer3DModel

< >

( num_attention_heads: int = 48 attention_head_dim: int = 64 in_channels: typing.Optional[int] = None out_channels: typing.Optional[int] = None patch_size: typing.Optional[int] = None sample_width: int = 90 sample_height: int = 60 activation_fn: str = 'gelu-approximate' timestep_activation_fn: str = 'silu' freq_shift: int = 0 num_layers: int = 48 mmdit_layers: int = 48 dropout: float = 0.0 time_embed_dim: int = 512 add_norm_text_encoder: bool = False text_embed_dim: int = 3584 text_embed_dim_t5: int = None norm_eps: float = 1e-05 norm_elementwise_affine: bool = True flip_sin_to_cos: bool = True time_position_encoding_type: str = '3d_rope' after_norm = False resize_inpaint_mask_directly: bool = True enable_text_attention_mask: bool = True add_noise_in_inpaint_model: bool = True )

Parameters

  • num_attention_heads (int, defaults to 48) — The number of heads to use for multi-head attention.
  • attention_head_dim (int, defaults to 64) — The number of channels in each head.
  • in_channels (int, defaults to 16) — The number of channels in the input.
  • out_channels (int, optional, defaults to 16) — The number of channels in the output.
  • patch_size (int, defaults to 2) — The size of the patches to use in the patch embedding layer.
  • sample_width (int, defaults to 90) — The width of the input latents.
  • sample_height (int, defaults to 60) — The height of the input latents.
  • activation_fn (str, defaults to "gelu-approximate") — Activation function to use in feed-forward.
  • timestep_activation_fn (str, defaults to "silu") — Activation function to use when generating the timestep embeddings.
  • num_layers (int, defaults to 30) — The number of layers of Transformer blocks to use.
  • mmdit_layers (int, defaults to 1000) — The number of layers of Multi Modal Transformer blocks to use.
  • dropout (float, defaults to 0.0) — The dropout probability to use.
  • time_embed_dim (int, defaults to 512) — Output dimension of timestep embeddings.
  • text_embed_dim (int, defaults to 4096) — Input dimension of text embeddings from the text encoder.
  • norm_eps (float, defaults to 1e-5) — The epsilon value to use in normalization layers.
  • norm_elementwise_affine (bool, defaults to True) — Whether to use elementwise affine in normalization layers.
  • flip_sin_to_cos (bool, defaults to True) — Whether to flip the sin to cos in the time embedding.
  • time_position_encoding_type (str, defaults to 3d_rope) — Type of time position encoding.
  • after_norm (bool, defaults to False) — Flag to apply normalization after.
  • resize_inpaint_mask_directly (bool, defaults to True) — Flag to resize inpaint mask directly.
  • enable_text_attention_mask (bool, defaults to True) — Flag to enable text attention mask.
  • add_noise_in_inpaint_model (bool, defaults to False) — Flag to add noise in inpaint model.

A Transformer model for video-like data in EasyAnimate.

Transformer2DModelOutput

class diffusers.models.modeling_outputs.Transformer2DModelOutput

< >

( sample: torch.Tensor )

Parameters

  • sample (torch.Tensor of shape (batch_size, num_channels, height, width) or (batch size, num_vector_embeds - 1, num_latent_pixels) if Transformer2DModel is discrete) — The hidden states output conditioned on the encoder_hidden_states input. If discrete, returns probability distributions for the unnoised latent pixels.

The output of Transformer2DModel.

< > Update on GitHub