THUDM/CogVideoX-2b · Size mismatch when preparing latents.

I'm encountering RuntimeError with sizes mismatches while using diffusers pipeline_cogvideox_image2video pipeline.

import torch
from diffusers import CogVideoXImageToVideoPipeline
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
)
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
pipe.to("cuda")
pipe(
        prompt=prompt,
        image=image,
        num_videos_per_prompt=1,
        num_inference_steps=num_inference_steps,
        num_frames=num_frames,
        guidance_scale=6,
        generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]

Attempting to understand this I've arrived to finding out that:

pipe.transformer.config.in_channels -> 16
pipe.vae.config.out_channels -> 3

Is there a way to fix this that I might be skipping? Also what should be my way of thinking through this?

My libs versions are:

diffusers version: 0.33.0.dev0
torch version: 2.5.1+cu121

Thanks!

Leaving my trace just in case:

File ~/.cache/pypoetry/virtualenvs/studio-v_PmMSSG-py3.12/lib/python3.12/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py:789, in CogVideoXImageToVideoPipeline.__call__(self, image, prompt, negative_prompt, height, width, num_frames, num_inference_steps, timesteps, guidance_scale, use_dynamic_cfg, num_videos_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, output_type, return_dict, attention_kwargs, callback_on_step_end, callback_on_step_end_tensor_inputs, max_sequence_length)
    787 latent_channels = self.transformer.config.in_channels // 2
    788 print(f"[DEBUG-2] {latent_channels}")
--> 789 latents, image_latents = self.prepare_latents(
    790     image,
    791     batch_size * num_videos_per_prompt,
    792     latent_channels,
    793     num_frames,
    794     height,
    795     width,
    796     prompt_embeds.dtype,
    797     device,
    798     generator,
    799     latents,
    800 )
    802 # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
    803 extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)

File ~/.cache/pypoetry/virtualenvs/studio-v_PmMSSG-py3.12/lib/python3.12/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py:407, in CogVideoXImageToVideoPipeline.prepare_latents(self, image, batch_size, num_channels_latents, num_frames, height, width, dtype, device, generator, latents)
    398 padding_shape = (
    399     batch_size,
    400     num_frames - 1,
   (...)
    403     width // self.vae_scale_factor_spatial,
    404 )
    406 latent_padding = torch.zeros(padding_shape, device=device, dtype=dtype)
--> 407 image_latents = torch.cat([image_latents, latent_padding], dim=1)
    409 # Select the first frame along the second dimension
    410 if self.transformer.config.patch_size_t is not None:

RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 16 but got size 8 for tensor number 1 in the list.