There does not seem to be any support for pre-training.
When I try, there seems to be some instability with the Connector. How did you initialize your weights?
can you say more about the instability you are seeing?
our initialization scheme for newly initialized parameters is rather standard. the code snippet below should give you a good idea:
if isinstance(module, MLP):
for sub_module_name, sub_module in module.named_modules():
if isinstance(sub_module, nn.Linear):
factor = 1.0
if "down_proj" in sub_module_name:
factor = 2.0
init_a_linear(sub_module, std=(0.4 / (self.config.hidden_size * factor)) ** 0.5)
Hi Victor,
Thank you for your response!
What I am seeing is that the loss initially decreases, but then NaN's are detected after the "connector" (MLP+Perceiver Pooler). I have tried xavier_uniform_/kaiming_uniform_ for all the connector whieghts -- but was unsuccessful.
I have tried the obvious -- varying batch sizes/learning rates (2-1000 and 1e-3-1e-6).
It is extremely regular -- seems to happen at the same iteration for the same batch size, no matter the learning rate. The only time this does not occur is when using batch size=1.
Have you ever experienced similar/how did you debug?
indeed nan are never a good sign....
before I answer, a few question:
- are you fine-tuning or training from scratch?
- what data?
- mixed precision? what precision?
- is it specifically after the connector? any details as to where in the connector?
Hi @VictorSanh ,
- I am training from scratch
- LLaVA 1.5
- BF16
- It is usually in the MLP of the Idefics2PerceiverLayer, usually after "gate_proj", very rarely after "down_proj".
I tried your initialization code, increasing the batch size to 4096 and reducing lr to 1r-06, but with no luck. When interrogating the issue further, I noticed that the 'latents' remain all-ones even when training persists to a few 100 iterations. I am sure that the parameters are added to the optimizer. I tried randomly initializing those instead, but that did not solve the issue.