Are there any differences between the Full-model and TE-only versions?
Hello, thank you for sharing this amazing CLIP!
I compared the Full-model and TE-only model in Flux.1, and I noticed subtle differences in the illustrations.
I understand that the TE-only model is extracted from the Full-model, containing only the text encoder part. Are there any other differences?
Or does the part other than the text encoder also contribute to image generation?
For reference, I always use the LongCLIP-SAE-ViT-L-14-FP32 model.
I featured this model in my blog:
Is Your Illustration High-Quality? Comparing T5xxl and CLIP-L with Real Data! | AI Image Journey
Thank you!
That's very interesting, thank you for sharing these results! And - it is quite unexpected, to be honest. Both the 'ViT-L-14-GmP-SAE-FULL-model.safetensors' and 'ViT-L-14-GmP-SAE-TE-only.safetensors' are partially converted to half precision (as per OpenAI's original CLIP code), so there should be no difference. What are you using to generate these? The only thing I can imagine is that the code handles a full model differently than a TE-only model somehow, i.e. as it extracts the TE from the full model. Technically, you could use different outputs from the Text Encoder; SDXL famously adjusted from using the final layer output to the penultimate layer output. After the final layer, you can also apply a final layer norm, or not, or use projection. This lead to some issues because some CLIP models for specific diffusion models simply lack the layers they don't need for that particular guidance, which then makes them incompatible with other diffusion models, see here for an example. However, for this very reason, even my "TE-only" models contain absolutely everything, the full Text Encoder up to including projection. So, there should be no difference, as whatever you are using to generate the images should find the same "extraction point" for embeddings and doesn't have to fall back to another one due to something missing from "TE-only".
Assuming from your (very nice! 👍) blog post, you're using ComfyUI or Forge; I'd open an issue on GitHub and ask about that there (feel free to include my response).
I'd be curious if there's also a difference to the pickle files, which are not converted to HuggingFace format (as the model.safetensors is), as I've seen plenty of research model repos shared with the statement that "safetensors will produce slightly different results than shown in the paper; to reproduce the results in the paper, please use the PyTorch model". Now, I don't expect you to download my pickles (after all, I am the only person to truly know that they are safe; you, on the other hand, can't be sure about that!); but if you did this in ComfyUI AND you'd be willing to share your full workflow, I'd be curious to try it myself!
Thank You for Your Reply
Thank you for your reply. I’m using ComfyUI.
First, I’ve attached the workflow I used for this comparison.
This PNG image includes the ComfyUI workflow.
Since I’m using the T5xxl in the Flan-T5xxl-FP32 format, I’ve configured ComfyUI to start with the --fp32-text-enc
option.
Although the CLIP-SAE-ViT-L-14 model is processed in FP32 format, it has already been rounded to FP16, so there shouldn’t be any difference in the results.
Flux.1[dev]
The phenomenon might be specific to the anime model I’m using, blue_pencil-flux1_v0.0.1-FP16. To check, I also compared the results with the original Flux.1[dev] model.
The workflow and conditions, except for the model, were exactly the same as before, including the prompts.
The results were still different.
Flux.1[shnell]
Additionally, I conducted a comparison using Flux.1[shnell].
Once again, the results were different. Upon closer inspection,
the details generated with the ViT-L-14-GmP-SAE-FULL-model seem more refined, and the overall quality appears higher!
Original CLIP-L
Finally, I wondered whether the same phenomenon occurs with the original CLIP-L.
openai/clip-vit-large-patch14/model.safetensors(FP32)
comfyanonymous/flux_text_encoders/clip_l.safetensors(FP16)
For this test, I configured ComfyUI to start with the --fp16-text-enc
option so both are processed in FP16 format.
Here, the results were completely identical!
Observations
Your CLIP may separate the text encoder differently from how ComfyUI separates it.
Also, I’m just a hobbyist image generation user, not a technical expert, so it’s possible I’m misunderstanding something. However, if your CLIP creates some form of crosstalk between the text encoder and the vision transformer, I find that concept quite fascinating.
Thank you again for sharing this incredible CLIP, and also for providing it in FP32 format!
Additional Comments
I asked ChatGPT about this.
Is Crosstalk Between the Text Encoder and Vision Transformer in CLIP Theoretically Possible?
CLIP’s design separates the Text Encoder and Vision Transformer as independent modules. However, there are theoretical possibilities for "crosstalk" between them due to the following reasons:
1. Shared Embedding Space
- CLIP maps both text and images into a shared embedding space (latent space).
- Since this space is shared, optimization of one module during training could indirectly influence the other.
2. Gradient Propagation During Training
- While the Text Encoder and Vision Transformer process separate data, the contrastive loss function aligns their outputs in the shared embedding space.
- This process might induce a form of interaction (crosstalk) between the two modules.
3. Effects of Layer Regularization
- Some implementations of CLIP employ layer normalization or weight sharing across modules.
- These mechanisms could lead to unintended interactions between the Text Encoder and Vision Transformer.
4. Customization During Fine-Tuning
- During fine-tuning for specific tasks, the Text Encoder and Vision Transformer may exhibit behaviors resembling crosstalk.
- This is especially true when adapting to specific datasets or optimization objectives.
Conclusion
While CLIP’s architecture does not explicitly design for crosstalk, indirect interactions can theoretically occur through the shared embedding space and training dynamics. Customized CLIP models or task-specific fine-tuning could make such effects more noticeable.
Wah, I didn't have time to look at this during weekdays and then nearly forgot - sorry about that! /o
Thank you for sharing the workflow - indeed, it has the exif data / the workflow. Good to know HF doesn't strip exif data from uploaded images. :)
Your results are indeed most interesting. Perhaps it is because of how I fine-tune CLIP? I mean, CLIP has linear .weight layers. A matrix of weights. To fine-tune and archive the improved results on benchmarks using just 1 GPU, I actually "split up" these vectors into direction and magnitude and optimize them separately - which is the (not-so-secret, open source) 'secret sauce' of my model. This stabilizes training with small batch sizes.
This causes a slight numerical instability; it causes non-determinism in things that should be deterministic. But GPU is also to blame for that (has inherent non-determinism / parallelism). And the backend (CUDA stuff) is not optimized to enable determinism for a .weight matrix that is present as the components thereof. A model with Geometric Parametrization will only be deterministic on CPU. But putting it back together into .weight, I observed that determinism (post-training) is restored for [benchmark / testing stuff].
Curious if what you observed is a residual side-effect. It could also be an instability in the diffusion model; it is aligned for exactly CLIP-L's embeddings (the original OpenAI model embeddings).
PS: What ChatGPT told you doesn't apply when you use CLIP as the Text Encoder for a diffusion model, as it's just the Text Encoder (not the Vision Encoder) in that scenario.
However, indeed, you could imagine the Text Encoder like a person, and the Vision Encoder like a second person. One is holding a long broomstick, the other is holding a small ring, and they're leaning over a giant gaping fault line (modality gap) and cannot touch / reach each other, but their task is to stick the broomstick into the ring. There's instability in that, and it's hard to get it just right, but that' basically what CLIP is doing (albeit my mathematics professor would be very angry with me for this oversimplification, haha): Aligning image and text so they meet inside this "gap" and alas being similar (dot product). So an image influences the text and a text influences the image. Kind of a "cross-talk", if you will.
In fact, CLIP's contrastive loss is a modified version of cross-entropy loss. So, it is a form of "cross-talk loss".
As I saw just how large the models you are using are, I decided to try with flux-1-dev first, as that's what I have. Nothing. No diff. Until I used a quantized flux, in which, indeed:
So, after this initial test, it seems to have to do with quantization of the diffusion model, I guess? And the inaccuracies by quantization then can play out, in addition to being 'misaligned' to the fine-tuned CLIP (because the diffusion model wasn't trained on it), to only just tip some numbers over and cause a slightly different result?
Lots of question marks, because this is not a statement, but a speculation. Fascinating, indeed!
Thanks for taking the time to reply!
Honestly, I was really surprised to hear that using --fp32-text-enc
could cause differences, considering I thought FP32 could fully encompass all FP16 values.
Let me ask a few questions first to make sure I understand your setup.
Questions
- The flux.1-dev you used—is it the model available on Hugging Face's Black Forest Lab?
Specifically:
- FP16 Transformer
- FP16 CLIP-L
- FP16 t5xxl_v1.1
- FP32 ae.safetensors (which ComfyUI typically processes as BF16)
Does that sound correct?
When you first measured no differences, were you swapping out CLIP-L in the above setup for:
ViT-L-14-GmP-SAE-FULL-model.safetensors
ViT-L-14-GmP-SAE-TE-only.safetensors
And running generation with
--fp16-text-enc
(or no specific flag)? Do I have that right?When you say the quantized Flux model showed differences, does that mean you ran the same test as in #2 but with
--fp32-text-enc
and observed the differences in the images you shared?
If I misunderstood anything, let me know!
Additional Tests I Ran
I also ran a couple of tests to dig into this further.
1. Running the Original Workflow with FP16
First, I ran the workflow I initially shared with you using --fp16-text-enc
.
This time, the two images were identical, meaning I was able to reproduce your results.
2. Comparing the Original CLIP-L in FP32
I took the original CLIP-L:
And converted it to FP16 using simple Python code. Then, I compared it with:
Using --fp32-text-enc
. The result? They were completely identical.
From this, I concluded two things:
- The CLIP-L distributed on ComfyUI's page is likely just a straightforward FP16 conversion of the original CLIP-L with the text encoder portion extracted.
- For the original CLIP-L, running inference in FP32 doesn’t seem to produce differences based on the presence or absence of the vision encoder.
Questions About Your CLIP
Your fine-tuned CLIP is fascinating! Here’s what I currently understand:
- You fine-tuned the
model.safetensors
ofViT-L-14-GmP-SAE
in FP32. - When converting this to
ViT-L-14-GmP-SAE-FULL-model.safetensors
in FP16, you may have applied some optimization for FP16 rather than just a straightforward conversion.
If that’s the case, it might explain why running --fp32-text-enc
on ViT-L-14-GmP-SAE-FULL-model.safetensors
could behave differently compared to the original model.safetensors
.
However, even if that’s true, I’m still curious why differences would arise when comparing ViT-L-14-GmP-SAE-FULL-model.safetensors
and ViT-L-14-GmP-SAE-TE-only.safetensors
with --fp32-text-enc
.
Thanks for answering all these questions—I really appreciate it! Feel free to reply whenever you have time.
- t5xxl_fp8_e4m3fn.safetensors : https://huggingface.co./comfyanonymous/flux_text_encoders/blob/main/t5xxl_fp8_e4m3fn.safetensors
- VAE, flux-1-dev: yes, from blackforest labs
- My CLIP (like you mentioned)
- Quantization: The setting in the loader node in ComfyUI (instead of "default")
- No args, just "python main.py"
And yes, the model is set to fp32 during fine-tuning, although letting the CUDA algorithms decide with 'automatic mixed precision' (they usually do a good job, though). Using OpenAI's CLIP ViT-L/14 with the original OpenAI code, i.e. "import clip" (not HuggingFace transformers).
I have saved the model in full precision, as a PyTorch pickle .pt (see my fine-tuning code here: https://github.com/zer0int/CLIP-fine-tune )and used that original precision (as saved from fine-tuning) for the largest model .safetensors, albeit converted from .pt to .safetensors via HuggingFace script:
https://github.com/huggingface/transformers/blob/main/src/transformers/models/clip/convert_clip_original_pytorch_to_hf.py
For the smaller versions, I have applied the fp16 conversion as defined in OpenAI's original code -> still saving it as a pickle / .pt, in original format used by OpenAI, and THEN converted to .safetensors.
I still have no idea what is happening / why this is happening, but I hope this helps! =)