lambdalabs/image-mixer-demo · Style vs Content via random rotation of crops?

Perhaps instead of 5 random crops, there could be 1 encode of the whole image and then 4 random rotated and flipped crops.

The idea behind this being that the one encode of the whole image would get the layout of the image, and then the 4 random rotated and flipped crops would get additional information about details and style. With the idea of style being anything that's not subject to change based on rotation.

The reason behind my belief that the content/style isn't preserved is that when given paintings, like Disco Elysium portraits, it returns paintings in a different style. When given two similar portraits of faces facing the same direction it can generate a distorted face facing the opposite direction.

Perhaps this + the original method of 5 random crops would allow for some more expression of style. Maybe just some crops get randomly rotated. Maybe a few random rotated crops get their embeddings averaged. Maybe encodings of the whole image get a learned scale / bias applied and encodings of crops get their own learned scale / bias applied, and if it's only ever 5 random crops it gets the average of the two scale / bias applied. Like some sort of mixup augmentation. Maybe when using random rotated crops + the whole image, scale down strength of the random rotated crops and maybe it'll learn to be more sensitive to parts of the embeddings linked to style.