different number of layers in the openclip text encoder
i realize the number of layers of text encoder in SD2.1 is different from the text encoder in laion/CLIP-ViT-H-14-laion2B-s32B-b79K
.
in SD2.1, the number of layers is 23 while in laion/CLIP-ViT-H-14-laion2B-s32B-b79K
it is 24. Just wondering if there is any impact on the generation quality/text understanding.
i realize the number of layers of text encoder in SD2.1 is different from the text encoder in
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
.in SD2.1, the number of layers is 23 while in
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
it is 24. Just wondering if there is any impact on the generation quality/text understanding.
According to your comment, laion/CLIP-ViT-H-14-laion2B-s32B-b79K have one more layer than SD 2.1. Is the additional layer a projection later to just modify the output dimension? Thank you! :)
Hi,
I have encountered the same problems. According to my test, SD2.1 indeed uses laion/CLIP-ViT-H-14-laion2B-s32B-b79K
as its CLIP model, but for the text encoder, the differences are:
- SD2.1 uses only the first 23
CLIPEncoderLayer
s (while the original has 24 layers), and the lastCLIPEncoderLayer
is not used, which is really weird; - SD2.1 uses only the
input_ids
and does not use theattention_mask
created by the tokenizer; - the tokenizer of SD2.1 places only ONE
<|eos|>
token (id: 49407) at the end of the sentence and fills the rest with ZERO. For example, the tokenized sequence (input_ids
) of the sentence"a cow"
will look like this:
tensor([[49406, 320, 9706, 49407, 0, 0, ..., 0]])
In contrast, the original OpenCLIP model's tokenizer pads the sequence with <|eos|>
tokens. So, the tokenized output for the same sentence will be:
tensor([[49406, 320, 9706, 49407, 49407, 49407, ..., 49407]])
Hope this information is helpful to you!
i realize the number of layers of text encoder in SD2.1 is different from the text encoder in
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
.in SD2.1, the number of layers is 23 while in
laion/CLIP-ViT-H-14-laion2B-s32B-b79K
it is 24. Just wondering if there is any impact on the generation quality/text understanding.According to your comment, laion/CLIP-ViT-H-14-laion2B-s32B-b79K have one more layer than SD 2.1. Is the additional layer a projection later to just modify the output dimension? Thank you! :)
@SnowflakeWang No, it is a transformer layer as @NCC79601 mentioned. Thank you for the analysis!!