Size mismatch when preparing latents.
#14
by
madousho
- opened
I'm encountering RuntimeError with sizes mismatches while using diffusers pipeline_cogvideox_image2video pipeline.
import torch
from diffusers import CogVideoXImageToVideoPipeline
pipe = CogVideoXImageToVideoPipeline.from_pretrained(
model_path,
torch_dtype=torch.bfloat16
)
pipe.vae.enable_tiling()
pipe.vae.enable_slicing()
pipe.to("cuda")
pipe(
prompt=prompt,
image=image,
num_videos_per_prompt=1,
num_inference_steps=num_inference_steps,
num_frames=num_frames,
guidance_scale=6,
generator=torch.Generator(device="cuda").manual_seed(42),
).frames[0]
Attempting to understand this I've arrived to finding out that:
pipe.transformer.config.in_channels -> 16
pipe.vae.config.out_channels -> 3
Is there a way to fix this that I might be skipping? Also what should be my way of thinking through this?
My libs versions are:
diffusers version: 0.33.0.dev0
torch version: 2.5.1+cu121
Thanks!
Leaving my trace just in case:
File ~/.cache/pypoetry/virtualenvs/studio-v_PmMSSG-py3.12/lib/python3.12/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py:789, in CogVideoXImageToVideoPipeline.__call__(self, image, prompt, negative_prompt, height, width, num_frames, num_inference_steps, timesteps, guidance_scale, use_dynamic_cfg, num_videos_per_prompt, eta, generator, latents, prompt_embeds, negative_prompt_embeds, output_type, return_dict, attention_kwargs, callback_on_step_end, callback_on_step_end_tensor_inputs, max_sequence_length)
787 latent_channels = self.transformer.config.in_channels // 2
788 print(f"[DEBUG-2] {latent_channels}")
--> 789 latents, image_latents = self.prepare_latents(
790 image,
791 batch_size * num_videos_per_prompt,
792 latent_channels,
793 num_frames,
794 height,
795 width,
796 prompt_embeds.dtype,
797 device,
798 generator,
799 latents,
800 )
802 # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
803 extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
File ~/.cache/pypoetry/virtualenvs/studio-v_PmMSSG-py3.12/lib/python3.12/site-packages/diffusers/pipelines/cogvideo/pipeline_cogvideox_image2video.py:407, in CogVideoXImageToVideoPipeline.prepare_latents(self, image, batch_size, num_channels_latents, num_frames, height, width, dtype, device, generator, latents)
398 padding_shape = (
399 batch_size,
400 num_frames - 1,
(...)
403 width // self.vae_scale_factor_spatial,
404 )
406 latent_padding = torch.zeros(padding_shape, device=device, dtype=dtype)
--> 407 image_latents = torch.cat([image_latents, latent_padding], dim=1)
409 # Select the first frame along the second dimension
410 if self.transformer.config.patch_size_t is not None:
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 16 but got size 8 for tensor number 1 in the list.