patrickvonplaten commited on
Commit
b2b0c4e
·
1 Parent(s): e3e102c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -47
README.md CHANGED
@@ -6,8 +6,14 @@ tags:
6
  inference: false
7
  ---
8
 
9
- # Stable Diffusion v1 Model Card
10
- This model card focuses on the model associated with the Stable Diffusion model, available [here](https://github.com/CompVis/stable-diffusion).
 
 
 
 
 
 
11
 
12
  ## Model Details
13
  - **Developed by:** Robin Rombach, Patrick Esser
@@ -27,28 +33,36 @@ This model card focuses on the model associated with the Stable Diffusion model,
27
  pages = {10684-10695}
28
  }
29
 
30
- ## Usage examples
 
 
31
 
32
  ```bash
33
  pip install --upgrade diffusers transformers scipy
34
  ```
35
 
36
  Run this command to log in with your HF Hub token if you haven't before:
 
37
  ```bash
38
  huggingface-cli login
39
  ```
40
 
41
  Running the pipeline with the default PLMS scheduler:
42
  ```python
 
43
  from torch import autocast
44
  from diffusers import StableDiffusionPipeline
45
 
46
- model_id = "CompVis/stable-diffusion-v1-3-diffusers"
47
- pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True).to("cuda")
 
 
 
 
48
 
49
  prompt = "a photograph of an astronaut riding a horse"
50
  with autocast("cuda"):
51
- image = pipe(prompt, guidance_scale=7)["sample"][0] # image here is in PIL format
52
 
53
  image.save(f"astronaut_rides_horse.png")
54
  ```
@@ -58,10 +72,11 @@ To swap out the noise scheduler, pass it to `from_pretrained`:
58
  ```python
59
  from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
60
 
61
- model_id = "CompVis/stable-diffusion-v1-3-diffusers"
62
  # Use the K-LMS scheduler here instead
63
  scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
64
- pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, use_auth_token=True).to("cuda")
 
65
  ```
66
 
67
  # Uses
@@ -83,8 +98,10 @@ _Note: This section is taken from the [DALLE-MINI model card](https://huggingfac
83
 
84
 
85
  The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
 
86
  #### Out-of-Scope Use
87
  The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
 
88
  #### Misuse and Malicious Use
89
  Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
90
 
@@ -113,6 +130,7 @@ Using the model to generate content that is cruel to individuals is a misuse of
113
  considerations.
114
 
115
  ### Bias
 
116
  While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
117
  Stable Diffusion v1 was trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/),
118
  which consists of images that are primarily limited to English descriptions.
@@ -123,29 +141,29 @@ ability of the model to generate content with non-English prompts is significant
123
 
124
  ## Training
125
 
126
- **Training Data**
127
  The model developers used the following dataset for training the model:
128
 
129
  - LAION-2B (en) and subsets thereof (see next section)
130
 
131
- **Training Procedure**
132
- Stable Diffusion v1 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
133
 
134
  - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
135
  - Text prompts are encoded through a ViT-L/14 text-encoder.
136
  - The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
137
  - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
138
 
139
- We currently provide three checkpoints, `sd-v1-1.ckpt`, `sd-v1-2.ckpt` and `sd-v1-3.ckpt`,
140
- which were trained as follows,
141
-
142
- - `sd-v1-1.ckpt`: 237k steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
143
- 194k steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
144
- - `sd-v1-2.ckpt`: Resumed from `sd-v1-1.ckpt`.
145
- 515k steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
146
  filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
147
- - `sd-v1-3.ckpt`: Resumed from `sd-v1-2.ckpt`. 195k steps at resolution `512x512` on "laion-improved-aesthetics" and 10\% dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
 
148
 
 
149
 
150
  - **Hardware:** 32 x 8 x A100 GPUs
151
  - **Optimizer:** AdamW
@@ -172,33 +190,6 @@ Based on that information, we estimate the following CO2 emissions using the [Ma
172
  - **Compute Region:** US-east
173
  - **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 11250 kg CO2 eq.
174
 
175
- ## Usage
176
-
177
- ### Setup
178
-
179
- - Install `diffusers` with
180
-
181
- `pip install -U git+https://github.com/huggingface/diffusers.git`
182
- - Install `transformers` with
183
-
184
- `pip install transformers`
185
-
186
- ```python
187
- import torch
188
- from diffusers import StableDiffusionPipeline
189
-
190
- pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-3-diffusers")
191
-
192
- prompt = "19th Century wooden engraving of Elon musk"
193
-
194
- seed = torch.manual_seed(1024)
195
- images = pipe([prompt], num_inference_steps=50, guidance_scale=7.5, generator=seed)["sample"]
196
-
197
- # save images
198
- for idx, image in enumerate(images):
199
- image.save(f"image-{idx}.png")
200
- ```
201
-
202
 
203
  ## Citation
204
 
@@ -213,4 +204,4 @@ for idx, image in enumerate(images):
213
  }
214
  ```
215
 
216
- *This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*
 
6
  inference: false
7
  ---
8
 
9
+ # Stable Diffusion v1-3 Model Card
10
+
11
+ Stable Diffusion is a latent text-to-image diffusion model capable of generating photo-realistic images given any text input.
12
+ For more information about how Stable Diffusion functions, please have a look at [🤗's Stable Diffusion with D🧨iffusers blog](hf.co/blog/stable_diffusion).
13
+
14
+ The **Stable-Diffusion-v1-3** checkpoint was initialized with the weights of the [Stable-Diffusion-v1-2](https:/steps/huggingface.co/CompVis/stable-diffusion-v1-2)
15
+ checkpoint and subsequently fine-tuned on 195,000 steps at resolution `512x512` on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598).
16
+ For more information, please refer to [Training](#training).
17
 
18
  ## Model Details
19
  - **Developed by:** Robin Rombach, Patrick Esser
 
33
  pages = {10684-10695}
34
  }
35
 
36
+ ## Examples
37
+
38
+ We recommend using [🤗's Diffusers library](https://github.com/huggingface/diffusers) to run Stable Diffusion.
39
 
40
  ```bash
41
  pip install --upgrade diffusers transformers scipy
42
  ```
43
 
44
  Run this command to log in with your HF Hub token if you haven't before:
45
+
46
  ```bash
47
  huggingface-cli login
48
  ```
49
 
50
  Running the pipeline with the default PLMS scheduler:
51
  ```python
52
+ import torch
53
  from torch import autocast
54
  from diffusers import StableDiffusionPipeline
55
 
56
+ model_id = "CompVis/stable-diffusion-v1-3"
57
+ device = "cuda"
58
+
59
+ generator = torch.Generator(device=device).manual_seed(0)
60
+ pipe = StableDiffusionPipeline.from_pretrained(model_id, use_auth_token=True)
61
+ pipe = pipe.to(device)
62
 
63
  prompt = "a photograph of an astronaut riding a horse"
64
  with autocast("cuda"):
65
+ image = pipe(prompt)["sample"][0] # image here is in PIL format
66
 
67
  image.save(f"astronaut_rides_horse.png")
68
  ```
 
72
  ```python
73
  from diffusers import StableDiffusionPipeline, LMSDiscreteScheduler
74
 
75
+ model_id = "CompVis/stable-diffusion-v1-3"
76
  # Use the K-LMS scheduler here instead
77
  scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
78
+ pipe = StableDiffusionPipeline.from_pretrained(model_id, scheduler=scheduler, use_auth_token=True)
79
+ pipe = pipe.to("cuda")
80
  ```
81
 
82
  # Uses
 
98
 
99
 
100
  The model should not be used to intentionally create or disseminate images that create hostile or alienating environments for people. This includes generating images that people would foreseeably find disturbing, distressing, or offensive; or content that propagates historical or current stereotypes.
101
+
102
  #### Out-of-Scope Use
103
  The model was not trained to be factual or true representations of people or events, and therefore using the model to generate such content is out-of-scope for the abilities of this model.
104
+
105
  #### Misuse and Malicious Use
106
  Using the model to generate content that is cruel to individuals is a misuse of this model. This includes, but is not limited to:
107
 
 
130
  considerations.
131
 
132
  ### Bias
133
+
134
  While the capabilities of image generation models are impressive, they can also reinforce or exacerbate social biases.
135
  Stable Diffusion v1 was trained on subsets of [LAION-2B(en)](https://laion.ai/blog/laion-5b/),
136
  which consists of images that are primarily limited to English descriptions.
 
141
 
142
  ## Training
143
 
144
+ ### Training Data
145
  The model developers used the following dataset for training the model:
146
 
147
  - LAION-2B (en) and subsets thereof (see next section)
148
 
149
+ ### Training Procedure
150
+ Stable Diffusion v1-4 is a latent diffusion model which combines an autoencoder with a diffusion model that is trained in the latent space of the autoencoder. During training,
151
 
152
  - Images are encoded through an encoder, which turns images into latent representations. The autoencoder uses a relative downsampling factor of 8 and maps images of shape H x W x 3 to latents of shape H/f x W/f x 4
153
  - Text prompts are encoded through a ViT-L/14 text-encoder.
154
  - The non-pooled output of the text encoder is fed into the UNet backbone of the latent diffusion model via cross-attention.
155
  - The loss is a reconstruction objective between the noise that was added to the latent and the prediction made by the UNet.
156
 
157
+ We currently provide four checkpoints, which were trained as follows.
158
+ - [`stable-diffusion-v1-1`](https://huggingface.co/CompVis/stable-diffusion-v1-1): 237,000 steps at resolution `256x256` on [laion2B-en](https://huggingface.co/datasets/laion/laion2B-en).
159
+ 194,000 steps at resolution `512x512` on [laion-high-resolution](https://huggingface.co/datasets/laion/laion-high-resolution) (170M examples from LAION-5B with resolution `>= 1024x1024`).
160
+ - [`stable-diffusion-v1-2`](https://huggingface.co/CompVis/stable-diffusion-v1-2): Resumed from `stable-diffusion-v1-1`.
161
+ 515,000 steps at resolution `512x512` on "laion-improved-aesthetics" (a subset of laion2B-en,
 
 
162
  filtered to images with an original size `>= 512x512`, estimated aesthetics score `> 5.0`, and an estimated watermark probability `< 0.5`. The watermark estimate is from the LAION-5B metadata, the aesthetics score is estimated using an [improved aesthetics estimator](https://github.com/christophschuhmann/improved-aesthetic-predictor)).
163
+ - [`stable-diffusion-v1-3`](https://huggingface.co/CompVis/stable-diffusion-v1-3): Resumed from `stable-diffusion-v1-2`. 195,000 steps at resolution `512x512` on "laion-improved-aesthetics" and 10 % dropping of the text-conditioning to improve [classifier-free guidance sampling](https://arxiv.org/abs/2207.12598)
164
+ - [**`stable-diffusion-v1-4`**](https://huggingface.co/CompVis/stable-diffusion-v1-4) *To-fill-here*
165
 
166
+ ### Training details
167
 
168
  - **Hardware:** 32 x 8 x A100 GPUs
169
  - **Optimizer:** AdamW
 
190
  - **Compute Region:** US-east
191
  - **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 11250 kg CO2 eq.
192
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
193
 
194
  ## Citation
195
 
 
204
  }
205
  ```
206
 
207
+ *This model card was written by: Robin Rombach and Patrick Esser and is based on the [DALL-E Mini model card](https://huggingface.co/dalle-mini/dalle-mini).*