lilelife commited on
Commit
3ddb680
β€’
1 Parent(s): 65d5e00

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +215 -3
README.md CHANGED
@@ -1,3 +1,215 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # OmniBooth
2
+
3
+ > OmniBooth: Learning Latent Control for Image Synthesis with Multi-modal Instruction <br>
4
+ > [Leheng Li](https://len-li.github.io), Weichao Qiu, Xu Yan, Jing He, Kaiqiang Zhou, Yingjie CAI, Qing LIAN, Bingbing Liu, Ying-Cong Chen
5
+
6
+ OmniBooth is a project focused on synthesizing image data following multi-modal instruction. Users can use text or image to control instance generation. This repository provides tools and scripts to process, train, and generate synthetic image data using COCO dataset, or self-designed data.
7
+
8
+ #### [Project Page](https://len-li.github.io/omnibooth-web) | [Paper](https://arxiv.org/) | [Video](https://len-li.github.io/omnibooth-web/videos/teaser-user-draw.mp4) | [Checkpoint](https://huggingface.co/lilelife/Omnibooth)
9
+
10
+ code: https://github.com/Len-Li/OmniBooth
11
+
12
+ ## Table of Contents
13
+
14
+ - [Installation](#installation)
15
+ - [Prepare Dataset](#prepare-dataset)
16
+ - [Prepare Checkpoint](#prepare-checkpoint)
17
+ - [Train](#train)
18
+ - [Inference](#inference)
19
+ - [Behavior analysis](#behavior-analysis)
20
+ - [Data sturture](#instance-data-structure)
21
+
22
+
23
+
24
+
25
+
26
+
27
+
28
+ ## Installation
29
+
30
+ To get started with OmniBooth, follow these steps:
31
+
32
+ 1. **Clone the repository:**
33
+ ```bash
34
+ git clone https://github.com/Len-Li/OmniBooth.git
35
+ cd OmniBooth
36
+ ```
37
+
38
+ 2. **Set up a environment :**
39
+ ```bash
40
+ pip install torch torchvision transformers
41
+ pip install diffusers==0.26.0.dev0
42
+ # We use a old version of diffusers, please take care of it.
43
+
44
+ pip install albumentations pycocotools
45
+ pip install git+https://github.com/cocodataset/panopticapi.git
46
+ ```
47
+
48
+
49
+
50
+
51
+ ## Prepare Dataset
52
+
53
+ You can skip this step if you just want to run a demo generation. I've prepared demo mask in `data/instance_dataset` for generation. Please see [Inference](#inference).
54
+
55
+ To train OmniBooth, follow the steps below:
56
+
57
+ 1. **Download the [COCONut](https://github.com/bytedance/coconut_cvpr2024/blob/main/preparing_datasets.md) dataset:**
58
+
59
+ We use COCONut-S split.
60
+ Please download the COCONut-S file and relabeled-COCO-val from [here](https://github.com/bytedance/coconut_cvpr2024?tab=readme-ov-file#dataset-splits) and put it in `data/coconut_dataset` folder. I recommend to use [Kaggle](https://www.kaggle.com/datasets/xueqingdeng/coconut) link.
61
+
62
+
63
+ 2. **Download the COCO dataset:**
64
+ ```
65
+ cd data/coconut_dataset
66
+ mkdir coco && cd coco
67
+
68
+ wget http://images.cocodataset.org/zips/train2017.zip
69
+ wget http://images.cocodataset.org/zips/val2017.zip
70
+ wget http://images.cocodataset.org/annotations/annotations_trainval2017.zip
71
+
72
+ unzip train2017.zip && unzip val2017.zip
73
+ unzip annotations_trainval2017.zip
74
+ ```
75
+
76
+
77
+
78
+
79
+ After preparation, you will be able to see the following directory structure:
80
+
81
+ ```
82
+ OmniBooth/
83
+ β”œβ”€β”€ data/
84
+ β”‚ β”œβ”€β”€ instance_dataset/
85
+ β”‚ β”œβ”€β”€ coconut_dataset/
86
+ β”‚ β”‚ β”œβ”€β”€ coco/
87
+ β”‚ β”‚ β”œβ”€β”€ coconut_s/
88
+ | | β”œβ”€β”€ relabeled_coco_val/
89
+ β”‚ β”‚ β”œβ”€β”€ annotations/
90
+ β”‚ β”‚ β”‚ β”œβ”€β”€ coconut_s.json
91
+ β”‚ β”‚ β”‚ β”œβ”€β”€ relabeled_coco_val.json
92
+ β”‚ β”‚ β”‚ β”œβ”€β”€ my-train.json
93
+ β”‚ β”‚ β”‚ β”œβ”€β”€ my-val.json
94
+ ```
95
+
96
+
97
+
98
+ ## Prepare Checkpoint
99
+ Our model is based on [stable-diffusion-xl-base-1.0](https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0). We additionaly use [sdxl-vae-fp16-fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) to avoid numerical issue in VAE decoding. Please download the two models and put them at `./OmniBooth/ckp/`.
100
+
101
+ Our checkpoint of OmniBooth is released in [huggingface](https://huggingface.co/lilelife/OmniBooth). If you want to use our model to run inference. Please put them at `./OmniBooth/ckp/`.
102
+
103
+ ## Train
104
+
105
+ ```bash
106
+ bash train.sh
107
+ ```
108
+ The details of the script are as follows:
109
+ ```bash
110
+ export MODEL_DIR="./ckp/stable-diffusion-xl-base-1.0"
111
+ export VAE_DIR="./ckp/sdxl-vae-fp16-fix"
112
+
113
+ export EXP_NAME="omnibooth_train"
114
+ export OUTPUT_DIR="./ckp/$EXP_NAME"
115
+
116
+ accelerate launch --gpu_ids 0, --num_processes 1 --main_process_port 3226 train.py \
117
+ --pretrained_model_name_or_path=$MODEL_DIR \
118
+ --pretrained_vae_model_name_or_path=$VAE_DIR \
119
+ --output_dir=$OUTPUT_DIR \
120
+ --width=1024 \
121
+ --height=1024 \
122
+ --patch_size=364 \
123
+ --learning_rate=4e-5 \
124
+ --num_train_epochs=12 \
125
+ --train_batch_size=1 \
126
+ --mulscale_batch_size=2 \
127
+ --mixed_precision="fp16" \
128
+ --num_validation_images=2 \
129
+ --validation_steps=500 \
130
+ --checkpointing_steps=5000 \
131
+ --checkpoints_total_limit=10 \
132
+ --ctrl_channel=1024 \
133
+ --use_sdxl=True \
134
+ --enable_xformers_memory_efficient_attention \
135
+ --report_to='wandb' \
136
+ --resume_from_checkpoint="latest" \
137
+ --tracker_project_name="omnibooth-demo"
138
+ ```
139
+
140
+ The training process will take 3 days to complete using 8 NVIDIA A100. We use batchsize=2, image height set as 1024, image width follows the ground-truth image ratio. It will take 65GB memory for each GPU.
141
+
142
+ ## Inference
143
+
144
+ ```bash
145
+ bash infer.sh
146
+ ```
147
+ You will find generated images at `./vis_dir/`. The image is shown as follows:
148
+ ![image](./ckp/plane.jpg)
149
+
150
+
151
+ ## Behavior analysis
152
+
153
+ 1. The text instruction is not perfect, it is applicable to descriptions of attributes like color, but it is difficult to provide more granular descriptions. Scaling the data and model can help with this problem.
154
+ 2. The image instruction may result in generated images with washed-out colors, possibly due to brightness augmentation. This can be adjusted by editing global prompt: β€˜a brighter image’.
155
+ 3. Video Dataset. Ideally, we should use video datasets to train image-instructed generation, similar to Anydoor. However, in our multi-modal setting, the cost of obtaining video datasets + tracking annotations + panoptic annotations is relatively high, so we only trained our model on the single-view COCO dataset. If you plan to expand the training data to video datasets, please let me know.
156
+
157
+
158
+ ## Instance data structure
159
+
160
+ I provide several instance mask datasets for inference in `data/instance_dataset`. This data is converted from coco dataset. The data structure is as follows:
161
+
162
+ ```
163
+ # use data/instance_dataset/plane as an example
164
+ 0000_mask.png
165
+ 0000.png
166
+ 0001_mask.png
167
+ 0001.png
168
+ 0002_mask.png
169
+ 0002.png
170
+ ...
171
+ prompt_dict.json
172
+ ```
173
+ The mask file is a binary mask that indicate the instance location. The image file is the optional image reference. Turn the `--text_or_img=img` to use it.
174
+
175
+ The `prompt_dict.json` is a dictionary contains instance prompt and global_prompt. The prompt is a string that describes the instance or global image. For example, `"prompt_dict.json"` is as follows:
176
+
177
+ ```json
178
+ {
179
+ "prompt_0": "a plane is silhouetted against a cloudy sky",
180
+ "prompt_1": "a road",
181
+ "prompt_2": "a pavement of merged",
182
+ "global_prompt": "large mustard yellow commercial airplane parked in the airport"
183
+ }
184
+ ```
185
+
186
+
187
+ ## Acknowledgment
188
+ Additionally, we express our gratitude to the authors of the following opensource projects:
189
+
190
+ - [Diffusers controlnet example](https://github.com/huggingface/diffusers/tree/main/examples/controlnet) (ControlNet training script)
191
+ - [COCONut](https://github.com/bytedance/coconut_cvpr2024) (Panoptic mask annotation)
192
+ - [SyntheOcc](https://len-li.github.io/syntheocc-web/) (Network structure)
193
+
194
+
195
+
196
+ ## BibTeX
197
+
198
+ ```bibtex
199
+ @inproceedings{li2024OmniBooth,
200
+ title={OmniBooth: Synthesize Geometric Controlled Street View Images through 3D Semantic MPIs},
201
+ author={Li, Leheng and Qiu, Weichao and Chen, Ying-Cong et.al.},
202
+ booktitle={arxiv preprint},
203
+ year={2024}
204
+ }
205
+ ```
206
+
207
+ This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
208
+
209
+
210
+
211
+
212
+
213
+ ---
214
+ license: mit
215
+ ---