thomwolf HF staff commited on
Commit
4c79552
·
1 Parent(s): 49d9c37
Files changed (1) hide show
  1. dist/index.html +9 -3
dist/index.html CHANGED
@@ -48,7 +48,7 @@
48
  </div>
49
  </d-title>
50
  <d-byline></d-byline>
51
- <d-article>
52
  <d-contents>
53
  </d-contents>
54
 
@@ -232,7 +232,10 @@
232
  <aside>As we’ll see later, these steps may be repeated or intertwined but for now we’ll start simple.</aside>
233
 
234
  <p>It looks generally like this: </p>
235
- <p><img alt="image.png" src="assets/images/placeholder.png" /></p>
 
 
 
236
 
237
  <p>In this figure, the boxes on the top line can be seen as successive layers inside a model (same for the last line). The red boxes are the associated gradients for each of these layers, computed during the backward pass.</p>
238
 
@@ -294,7 +297,10 @@
294
 
295
  <p>Using this snippet [TODO: link to appendix A5], we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
296
 
297
- <p><img alt="llama-1b-memory.png" src="assets/images/placeholder.png" /></p>
 
 
 
298
 
299
  <p>Clearly the first step looks very different from the subsequent ones, but let’s first have a look at the general anatomy of a step: first the activations increase quickly as we do the forward pass, then during the backward pass the gradients build up and as the backward pass propagates, the stored activations used to compute the gradients are progressively cleared. Finally, we perform the optimization step during which we need all the gradients and then update the optimizer states before we start the next forward pass. </p>
300
 
 
48
  </div>
49
  </d-title>
50
  <d-byline></d-byline>
51
+ <d-article>
52
  <d-contents>
53
  </d-contents>
54
 
 
232
  <aside>As we’ll see later, these steps may be repeated or intertwined but for now we’ll start simple.</aside>
233
 
234
  <p>It looks generally like this: </p>
235
+
236
+ <div class="svg-container" id="svg-first_steps_simple_training"> </div>
237
+ <div class="info" id="info">Hover over the network elements to see their details</div>
238
+ <script src="../assets/images/first_steps_simple_training.js"></script>
239
 
240
  <p>In this figure, the boxes on the top line can be seen as successive layers inside a model (same for the last line). The red boxes are the associated gradients for each of these layers, computed during the backward pass.</p>
241
 
 
297
 
298
  <p>Using this snippet [TODO: link to appendix A5], we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
299
 
300
+ <div class="svg-container l-body-outset" id="svg-first_steps_memory_profile"> </div>
301
+ <script src="../assets/images/first_steps_memory_profile.js"></script>
302
+
303
+ <!-- <p><img alt="llama-1b-memory.png" src="assets/images/placeholder.png" /></p> -->
304
 
305
  <p>Clearly the first step looks very different from the subsequent ones, but let’s first have a look at the general anatomy of a step: first the activations increase quickly as we do the forward pass, then during the backward pass the gradients build up and as the backward pass propagates, the stored activations used to compute the gradients are progressively cleared. Finally, we perform the optimization step during which we need all the gradients and then update the optimizer states before we start the next forward pass. </p>
306