alimama-creative/SD3-Controlnet-Inpainting

Aug 17, 2024

Hi,
Thanks for the excellent work! Have you considered making the training code open source?

ljp

alimama-creative org Aug 19, 2024

Sorry, we have no plans to open the training code, but most of the code is based on small modifications to the sd3 dream booth lora training script in the diffusers library, and the specific link can be found here （https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/train_dreambooth_lora_sd3.py）

CaptainZZZ

Aug 19, 2024

Thanks a lot for the response!
And i'm curious about how to generate "controlnet-inpainting" dataset by using 12M laion2B?

ljp

alimama-creative org Aug 19, 2024

It is obtained by filtering with high threshold values for resolution, aesthetics, and clip scores, and the mask is randomly generated.

CaptainZZZ

Aug 20, 2024

Thanks so much for the reply !

CaptainZZZ

Aug 23, 2024

Hi,
I have another question, what and how many devices did you use for training, and how many days did the training cost ?
Thanks a lot for the reply!

CaptainZZZ

Aug 23, 2024

@ljp

ljp

alimama-creative org Aug 25, 2024

16 x A100 used for a week.

chenhl99

Sep 19, 2024

how long does it take to get a decent result? maybe 1-3days and the model may rought converge? @ljp

CaptainZZZ

Sep 23, 2024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

ljp

alimama-creative org Sep 24, 2024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

CaptainZZZ

Sep 24, 2024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

ljp

alimama-creative org Sep 24, 2024

Select high-resolution images and then crop and resize them to 1024x1024.

chenhl99

Sep 25, 2024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

CaptainZZZ

Sep 25, 2024

Select high-resolution images and then crop and resize them to 1024x1024.

Thanks!

CaptainZZZ

Sep 25, 2024

•

edited Sep 25, 2024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

I would be grateful if the author could give us some suggestions @ljp

chenhl99

Sep 25, 2024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

I would be grateful if the author could give us some suggestions @ljp

yeah I do encounter exactly same issue like you
the blocky lines in the background
I can find this in every single image I generated
so i really dont know why

CaptainZZZ

Sep 25, 2024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

I would be grateful if the author could give us some suggestions @ljp

yeah I do encounter exactly same issue like you
the blocky lines in the background
I can find this in every single image I generated
so i really dont know why

Hi,
Could you please share your train details (batch size, accelerate cmd line, train dataset and number of train images)?
Also, which steps of checkpoints did you test on, and did the model converge at that point?

chenhl99

Sep 26, 2024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

I would be grateful if the author could give us some suggestions @ljp

yeah I do encounter exactly same issue like you
the blocky lines in the background
I can find this in every single image I generated
so i really dont know why

Hi,
Could you please share your train details (batch size, accelerate cmd line, train dataset and number of train images)?
Also, which steps of checkpoints did you test on, and did the model converge at that point?

may i have your email adress or WeChat for further discussion?

CaptainZZZ

Sep 26, 2024

•

edited Sep 26, 2024

Thanks so much for the reply, another question is what is the high threshold values for resolution you set ?
For example, above 512x512?

all crop to 1024x1024

Sorry maybe my question wasn't clear, I wanted to ask if you chose the high resolution images from laion and than resize them to 1024x1024, as I found that laion has a lot of low resolution images.
I tried train from scratch with ~50k laion images, after training the output of model is semantically correct but there is a lot of noise and it looks fuzzy. I think it may related to trained with low resolution images?
Once again, thanks so much for the reply and guidance!

Can you demostrate some of your experimental results? I feel like I've encountered some issues similar to yours, and I'm not sure if it's a problem with the SD3 ControlNet itself.

Sure.
I trained with nearly 30k laion images, layers of controlnet is 23, cmd line is here:

accelerate launch train_controlnet_sd3.py
--pretrained_model_name_or_path=$MODEL_DIR
--output_dir=$OUTPUT_DIR
--train_data_dir="../CustomDataset"
--resolution=1024
--learning_rate=1e-5
--mixed_precision="fp16"
--max_train_steps=100000
--train_batch_size=1
--gradient_checkpointing
--dataloader_num_workers=4
--gradient_accumulation_steps=4
--checkpointing_steps 2000
--report_to=tensorboard
--logging_dir="./tensorboard_log"
--resume_from_checkpoint=latest

I only have 1 A6000 48G so I set batch_size to 1.
The test model I used is after training with 10k steps, but it seems strange that after 5k steps the train loss is oscillating and does not converge.
Here is a result, like I just explained, the output is of low quality and seem blurry and blocky. What about your results?
Prompt: A cat is sitting next to a puppy.

I would be grateful if the author could give us some suggestions @ljp

yeah I do encounter exactly same issue like you
the blocky lines in the background
I can find this in every single image I generated
so i really dont know why

Hi,
Could you please share your train details (batch size, accelerate cmd line, train dataset and number of train images)?
Also, which steps of checkpoints did you test on, and did the model converge at that point?

may i have your email adress or WeChat for further discussion?

email

CaptainZZZ

Oct 9, 2024

•

edited Oct 9, 2024

Hi author,
May I ask if the base SD3 model you use was fp16 checkpoints or fp32 checkpoints? I trained ControlNet with SD3 fp16 checkpoints for training and found that the results were really bad and the image like I have mentioned above.
Looking forward to your reply. Thanks so much!

AppleYang

Oct 28, 2024

+1, i have the same issue in sd3 + controlnet for background generation. But switch to sd3 (with 33 channel transformers), the results can be good. I guess controlnet is not supported well for sd3.

CaptainZZZ

Oct 28, 2024

+1, i have the same issue in sd3 + controlnet for background generation. But switch to sd3 (with 33 channel transformers), the results can be good. I guess controlnet is not supported well for sd3.

Could you please explain what is "sd3 (with 33 channel transformers)"? Thanks!

AppleYang

Oct 28, 2024

16 x A100 used for a week.

You say that your batch_size=192, so for each A100, the batch_size=12, but I can only set batch_size=6 for A100 with resolution=1024*1024 and layers=23 in transformer.

AppleYang

Oct 28, 2024

+1, i have the same issue in sd3 + controlnet for background generation. But switch to sd3 (with 33 channel transformers), the results can be good. I guess controlnet is not supported well for sd3.

Could you please explain what is "sd3 (with 33 channel transformers)"? Thanks!

Like sdxl-inpainting model, with 4+4+1, 4 channels for noise, 4 channels for masked image, and 1 channel for 0-1 mask tensor. And for sd3, the channel number is 16, so it is 16+16+1=33 channels.

CaptainZZZ

Oct 28, 2024

•

edited Oct 28, 2024

+1, i have the same issue in sd3 + controlnet for background generation. But switch to sd3 (with 33 channel transformers), the results can be good. I guess controlnet is not supported well for sd3.

Could you please explain what is "sd3 (with 33 channel transformers)"? Thanks!

Like sdxl-inpainting model, with 4+4+1, 4 channels for noise, 4 channels for masked image, and 1 channel for 0-1 mask tensor. And for sd3, the channel number is 16, so it is 16+16+1=33 channels.

Could you please contact me with email: [email protected]? we can discuss about sd3 controlnet training, thanks

CaptainZZZ

Oct 28, 2024

16 x A100 used for a week.

You say that your batch_size=192, so for each A100, the batch_size=12, but I can only set batch_size=6 for A100 with resolution=1024*1024 and layers=23 in transformer.

Same, for me only 6 batch in 1 A100 80G

CaptainZZZ

Oct 30, 2024

16 x A100 used for a week.

You say that your batch_size=192, so for each A100, the batch_size=12, but I can only set batch_size=6 for A100 with resolution=1024*1024 and layers=23 in transformer.

Maybe you can try deepspeed to reduce the VRAM occupation

CaptainZZZ

Nov 13, 2024

Hi, have you solve the issue ? would be grateful if you can give some advice
@AppleYang

AppleYang

Nov 13, 2024

No, the results have the same issue. Like this :

CaptainZZZ

Nov 13, 2024

@AppleYang
Exactly the same issue, may I have you E-mail for further disscusion? or you can contact me: [email protected], thanks!

Maguro97

Nov 21, 2024

Hi, I'm also trying to train a controlnet to do inpainting. The system works pretty well, but I can't keep the areas outside the mask identical to the input image. Can you give me some references on the loss you used?

AppleYang

Nov 26, 2024

Hi, I'm also trying to train a controlnet to do inpainting. The system works pretty well, but I can't keep the areas outside the mask identical to the input image. Can you give me some references on the loss you used?

Do you train a SD3 controlnet for inpainting task, and you haven't faced the above issue. So can you share some codes for me, and I am curious about how you solved this issue.

CaptainZZZ

Dec 5, 2024

Hi, I'm also trying to train a controlnet to do inpainting. The system works pretty well, but I can't keep the areas outside the mask identical to the input image. Can you give me some references on the loss you used?

Hey, how many layers you used for training controlnet (6 , 12 or 23) ?

alimama-creative
/

SD3-Controlnet-Inpainting

About training code