WOW

#1
by nicolollo - opened

wow ...what the f. did you train this on ? The caption are slightly shorter? but they have way less hallucination 0.o ? is this a finetuning or a pretraining ? and this is also 224x224 damn

also prompt suggestions? Also it looks the output are like Florence2 and Docci

Hello! This is the 224x224 model trained on the full DOCCI dataset (entire dataset, no splits) and “enriched” with ~30 images of gay porn with DOCCI style captions (not joking, the model is meant for captioning gay porn). The “enrichment” dataset was combined with DOCCI using concatenate_datasets and then shuffled.

The model was trained for 10 epochs with a batch size per device of 4 and 24 gradient accumulation steps.

The model’s language tower and multi modal projector were trained, while the vision tower was frozen.

Model was trained in colab using an A100 and took a little over 5 hours. Loss was ~1.89 when training was complete.

As for the prompt, this was trained on the paligemma “caption” prompt. You might be able to still use the model for VQA or open vocabulary detection, though the model was not trained for that.

Thats weird i would swear you made something with Florence2 because the output really resemble it 30 image of gay porn enhancing the description of people is weird tho XD just using docci i got like for example legs positions that were incorrect but not with and the 1.5 one lol , such a small amount making such a difference ? what is concatenate_datasets ? you mean you joint the 2 datasets together using concatenate_datasets function ?

Hello! This is the 224x224 model trained on the full DOCCI dataset (entire dataset, no splits) and “enriched” with ~30 images of gay porn with DOCCI style captions (not joking, the model is meant for captioning gay porn). The “enrichment” dataset was combined with DOCCI using concatenate_datasets and then shuffled.

The model was trained for 10 epochs with a batch size per device of 4 and 24 gradient accumulation steps.

The model’s language tower and multi modal projector were trained, while the vision tower was frozen.

Model was trained in colab using an A100 and took a little over 5 hours. Loss was ~1.89 when training was complete.

nicolollo changed discussion status to closed
nicolollo changed discussion status to open

Sign up or log in to comment