Diffusers documentation

EasyAnimate

Diffusers

You are viewing main version, which requires installation from source. If you'd like regular pip install, checkout the latest stable version (v0.32.2).

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

EasyAnimate

EasyAnimate by Alibaba PAI.

The description from it’s GitHub page: EasyAnimate is a pipeline based on the transformer architecture, designed for generating AI images and videos, and for training baseline models and Lora models for Diffusion Transformer. We support direct prediction from pre-trained EasyAnimate models, allowing for the generation of videos with various resolutions, approximately 6 seconds in length, at 8fps (EasyAnimateV5.1, 1 to 49 frames). Additionally, users can train their own baseline and Lora models for specific style transformations.

This pipeline was contributed by bubbliiiing. The original codebase can be found here. The original weights can be found under hf.co/alibaba-pai.

There are two official EasyAnimate checkpoints for text-to-video and video-to-video.

checkpoints	recommended inference dtype
`alibaba-pai/EasyAnimateV5.1-12b-zh`	torch.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	torch.float16

There is one official EasyAnimate checkpoints available for image-to-video and video-to-video.

checkpoints	recommended inference dtype
`alibaba-pai/EasyAnimateV5.1-12b-zh-InP`	torch.float16

There are two official EasyAnimate checkpoints available for control-to-video.

checkpoints	recommended inference dtype
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control`	torch.float16
`alibaba-pai/EasyAnimateV5.1-12b-zh-Control-Camera`	torch.float16

For the EasyAnimateV5.1 series:

Text-to-video (T2V) and Image-to-video (I2V) works for multiple resolutions. The width and height can vary from 256 to 1024.
Both T2V and I2V models support generation with 1~49 frames and work best at this value. Exporting videos at 8 FPS is recommended.

Quantization

Quantization helps reduce the memory requirements of very large models by storing model weights in a lower precision data type. However, quantization may have varying impact on video quality depending on the video model.

Refer to the Quantization overview to learn more about supported quantization backends and selecting a quantization backend that supports your use case. The example below demonstrates how to load a quantized EasyAnimatePipeline for inference with bitsandbytes.

import torch
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig, EasyAnimateTransformer3DModel, EasyAnimatePipeline
from diffusers.utils import export_to_video

quant_config = DiffusersBitsAndBytesConfig(load_in_8bit=True)
transformer_8bit = EasyAnimateTransformer3DModel.from_pretrained(
    "alibaba-pai/EasyAnimateV5.1-12b-zh",
    subfolder="transformer",
    quantization_config=quant_config,
    torch_dtype=torch.float16,
)

pipeline = EasyAnimatePipeline.from_pretrained(
    "alibaba-pai/EasyAnimateV5.1-12b-zh",
    transformer=transformer_8bit,
    torch_dtype=torch.float16,
    device_map="balanced",
)

prompt = "A cat walks on the grass, realistic style."
negative_prompt = "bad detailed"
video = pipeline(prompt=prompt, negative_prompt=negative_prompt, num_frames=49, num_inference_steps=30).frames[0]
export_to_video(video, "cat.mp4", fps=8)

EasyAnimatePipeline

class diffusers.EasyAnimatePipeline

< source >

( vae: AutoencoderKLMagvit text_encoder: typing.Union[transformers.models.qwen2_vl.modeling_qwen2_vl.Qwen2VLForConditionalGeneration, transformers.models.bert.modeling_bert.BertModel] tokenizer: typing.Union[transformers.models.qwen2.tokenization_qwen2.Qwen2Tokenizer, transformers.models.bert.tokenization_bert.BertTokenizer] transformer: EasyAnimateTransformer3DModel scheduler: FlowMatchEulerDiscreteScheduler )

Parameters

vae (AutoencoderKLMagvit) — Variational Auto-Encoder (VAE) Model to encode and decode video to and from latent representations.
text_encoder (Optional[~transformers.Qwen2VLForConditionalGeneration, ~transformers.BertModel]) — EasyAnimate uses qwen2 vl in V5.1.
tokenizer (Optional[~transformers.Qwen2Tokenizer, ~transformers.BertTokenizer]) — A Qwen2Tokenizer or BertTokenizer to tokenize text.
transformer (EasyAnimateTransformer3DModel) — The EasyAnimate model designed by EasyAnimate Team.
scheduler (FlowMatchEulerDiscreteScheduler) — A scheduler to be used in combination with EasyAnimate to denoise the encoded image latents.

Pipeline for text-to-video generation using EasyAnimate.

This model inherits from DiffusionPipeline. Check the superclass documentation for the generic methods the library implements for all the pipelines (such as downloading or saving, running on a particular device, etc.)

EasyAnimate uses one text encoder qwen2 vl in V5.1.

call

< source >

( prompt: typing.Union[str, typing.List[str]] = None num_frames: typing.Optional[int] = 49 height: typing.Optional[int] = 512 width: typing.Optional[int] = 512 num_inference_steps: typing.Optional[int] = 50 guidance_scale: typing.Optional[float] = 5.0 negative_prompt: typing.Union[str, typing.List[str], NoneType] = None num_images_per_prompt: typing.Optional[int] = 1 eta: typing.Optional[float] = 0.0 generator: typing.Union[torch._C.Generator, typing.List[torch._C.Generator], NoneType] = None latents: typing.Optional[torch.Tensor] = None prompt_embeds: typing.Optional[torch.Tensor] = None timesteps: typing.Optional[typing.List[int]] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None output_type: typing.Optional[str] = 'pil' return_dict: bool = True callback_on_step_end: typing.Union[typing.Callable[[int, int, typing.Dict], NoneType], diffusers.callbacks.PipelineCallback, diffusers.callbacks.MultiPipelineCallbacks, NoneType] = None callback_on_step_end_tensor_inputs: typing.List[str] = ['latents'] guidance_rescale: float = 0.0 ) → StableDiffusionPipelineOutput or tuple

Returns

StableDiffusionPipelineOutput or tuple

If return_dict is True, StableDiffusionPipelineOutput is returned, otherwise a tuple is returned where the first element is a list with the generated images and the second element is a list of bools indicating whether the corresponding generated image contains “not-safe-for-work” (nsfw) content.

Generates images or video using the EasyAnimate pipeline based on the provided prompts.

Examples:

>>> import torch
>>> from diffusers import EasyAnimatePipeline
>>> from diffusers.utils import export_to_video

>>> # Models: "alibaba-pai/EasyAnimateV5.1-12b-zh"
>>> pipe = EasyAnimatePipeline.from_pretrained(
...     "alibaba-pai/EasyAnimateV5.1-7b-zh-diffusers", torch_dtype=torch.float16
... ).to("cuda")
>>> prompt = (
...     "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
...     "The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
...     "pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
...     "casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
...     "The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
...     "atmosphere of this unique musical performance."
... )
>>> sample_size = (512, 512)
>>> video = pipe(
...     prompt=prompt,
...     guidance_scale=6,
...     negative_prompt="bad detailed",
...     height=sample_size[0],
...     width=sample_size[1],
...     num_inference_steps=50,
... ).frames[0]
>>> export_to_video(video, "output.mp4", fps=8)

prompt (str or List[str], optional): Text prompts to guide the image or video generation. If not provided, use prompt_embeds instead. num_frames (int, optional): Length of the generated video (in frames). height (int, optional): Height of the generated image in pixels. width (int, optional): Width of the generated image in pixels. num_inference_steps (int, optional, defaults to 50): Number of denoising steps during generation. More steps generally yield higher quality images but slow down inference. guidance_scale (float, optional, defaults to 5.0): Encourages the model to align outputs with prompts. A higher value may decrease image quality. negative_prompt (str or List[str], optional): Prompts indicating what to exclude in generation. If not specified, use negative_prompt_embeds. num_images_per_prompt (int, optional, defaults to 1): Number of images to generate for each prompt. eta (float, optional, defaults to 0.0): Applies to DDIM scheduling. Controlled by the eta parameter from the related literature. generator (torch.Generator or List[torch.Generator], optional): A generator to ensure reproducibility in image generation. latents (torch.Tensor, optional): Predefined latent tensors to condition generation. prompt_embeds (torch.Tensor, optional): Text embeddings for the prompts. Overrides prompt string inputs for more flexibility. negative_prompt_embeds (torch.Tensor, optional): Embeddings for negative prompts. Overrides string inputs if defined. prompt_attention_mask (torch.Tensor, optional): Attention mask for the primary prompt embeddings. negative_prompt_attention_mask (torch.Tensor, optional): Attention mask for negative prompt embeddings. output_type (str, optional, defaults to “latent”): Format of the generated output, either as a PIL image or as a NumPy array. return_dict (bool, optional, defaults to True): If True, returns a structured output. Otherwise returns a simple tuple. callback_on_step_end (Callable, optional): Functions called at the end of each denoising step. callback_on_step_end_tensor_inputs (List[str], optional): Tensor names to be included in callback function calls. guidance_rescale (float, optional, defaults to 0.0): Adjusts noise levels based on guidance scale. original_size (Tuple[int, int], optional, defaults to (1024, 1024)): Original dimensions of the output. target_size (Tuple[int, int], optional): Desired output dimensions for calculations. crops_coords_top_left (Tuple[int, int], optional, defaults to (0, 0)): Coordinates for cropping.

encode_prompt

< source >

( prompt: typing.Union[str, typing.List[str]] num_images_per_prompt: int = 1 do_classifier_free_guidance: bool = True negative_prompt: typing.Union[str, typing.List[str], NoneType] = None prompt_embeds: typing.Optional[torch.Tensor] = None negative_prompt_embeds: typing.Optional[torch.Tensor] = None prompt_attention_mask: typing.Optional[torch.Tensor] = None negative_prompt_attention_mask: typing.Optional[torch.Tensor] = None device: typing.Optional[torch.device] = None dtype: typing.Optional[torch.dtype] = None max_sequence_length: int = 256 )

Parameters

prompt (str or List[str], optional) — prompt to be encoded
device — (torch.device): torch device
dtype (torch.dtype) — torch dtype
num_images_per_prompt (int) — number of images that should be generated per prompt
do_classifier_free_guidance (bool) — whether to use classifier free guidance or not
negative_prompt (str or List[str], optional) — The prompt or prompts not to guide the image generation. If not defined, one has to pass negative_prompt_embeds instead. Ignored when not using guidance (i.e., ignored if guidance_scale is less than 1).
prompt_embeds (torch.Tensor, optional) — Pre-generated text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, text embeddings will be generated from prompt input argument.
negative_prompt_embeds (torch.Tensor, optional) — Pre-generated negative text embeddings. Can be used to easily tweak text inputs, e.g. prompt weighting. If not provided, negative_prompt_embeds will be generated from negative_prompt input argument.
prompt_attention_mask (torch.Tensor, optional) — Attention mask for the prompt. Required when prompt_embeds is passed directly.
negative_prompt_attention_mask (torch.Tensor, optional) — Attention mask for the negative prompt. Required when negative_prompt_embeds is passed directly.
max_sequence_length (int, optional) — maximum sequence length to use for the prompt.

Encodes the prompt into text encoder hidden states.

EasyAnimatePipelineOutput

class diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput

< source >

( frames: Tensor )

Parameters

frames (torch.Tensor, np.ndarray, or List[List[PIL.Image.Image]]) — List of video outputs - It can be a nested list of length batch_size, with each sub-list containing denoised PIL image sequences of length num_frames. It can also be a NumPy array or Torch tensor of shape (batch_size, num_frames, channels, height, width).

Output class for EasyAnimate pipelines.

< > Update on GitHub

←DiT Flux→

Diffusers

EasyAnimate

Quantization

EasyAnimatePipeline

class diffusers.EasyAnimatePipeline

__call__

encode_prompt

EasyAnimatePipelineOutput

class diffusers.pipelines.easyanimate.pipeline_output.EasyAnimatePipelineOutput

call