Text-to-Image
Diffusers
Safetensors
StableDiffusionPipeline
stable-diffusion
Inference Endpoints

different number of layers in the openclip text encoder

#20
by kamwoh - opened

i realize the number of layers of text encoder in SD2.1 is different from the text encoder in laion/CLIP-ViT-H-14-laion2B-s32B-b79K.

in SD2.1, the number of layers is 23 while in laion/CLIP-ViT-H-14-laion2B-s32B-b79K it is 24. Just wondering if there is any impact on the generation quality/text understanding.

i realize the number of layers of text encoder in SD2.1 is different from the text encoder in laion/CLIP-ViT-H-14-laion2B-s32B-b79K.

in SD2.1, the number of layers is 23 while in laion/CLIP-ViT-H-14-laion2B-s32B-b79K it is 24. Just wondering if there is any impact on the generation quality/text understanding.

According to your comment, laion/CLIP-ViT-H-14-laion2B-s32B-b79K have one more layer than SD 2.1. Is the additional layer a projection later to just modify the output dimension? Thank you! :)

Hi,
I have encountered the same problems. According to my test, SD2.1 indeed uses laion/CLIP-ViT-H-14-laion2B-s32B-b79K as its CLIP model, but for the text encoder, the differences are:

  1. SD2.1 uses only the first 23 CLIPEncoderLayers (while the original has 24 layers), and the last CLIPEncoderLayer is not used, which is really weird;
  2. SD2.1 uses only the input_ids and does not use the attention_mask created by the tokenizer;
  3. the tokenizer of SD2.1 places only ONE <|eos|> token (id: 49407) at the end of the sentence and fills the rest with ZERO. For example, the tokenized sequence (input_ids) of the sentence "a cow" will look like this:
tensor([[49406, 320, 9706, 49407, 0, 0, ..., 0]])

In contrast, the original OpenCLIP model's tokenizer pads the sequence with <|eos|> tokens. So, the tokenized output for the same sentence will be:

tensor([[49406, 320, 9706, 49407, 49407, 49407, ..., 49407]])

Hope this information is helpful to you!

i realize the number of layers of text encoder in SD2.1 is different from the text encoder in laion/CLIP-ViT-H-14-laion2B-s32B-b79K.

in SD2.1, the number of layers is 23 while in laion/CLIP-ViT-H-14-laion2B-s32B-b79K it is 24. Just wondering if there is any impact on the generation quality/text understanding.

According to your comment, laion/CLIP-ViT-H-14-laion2B-s32B-b79K have one more layer than SD 2.1. Is the additional layer a projection later to just modify the output dimension? Thank you! :)

@SnowflakeWang No, it is a transformer layer as @NCC79601 mentioned. Thank you for the analysis!!

Sign up or log in to comment