Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

thomwolf HF staff commited on 13 days ago

Commit

4c79552

1 Parent(s): 49d9c37

update

Browse files

Files changed (1) hide show

dist/index.html +9 -3

dist/index.html CHANGED Viewed

@@ -48,7 +48,7 @@
         </div>
     </d-title>
     <d-byline></d-byline>
-    <d-article>
         <d-contents>
         </d-contents>
@@ -232,7 +232,10 @@
         <aside>As we’ll see later, these steps may be repeated or intertwined but for now we’ll start simple.</aside>
         <p>It looks generally like this: </p>
-        <p><img alt="image.png" src="assets/images/placeholder.png" /></p>
         <p>In this figure, the boxes on the top line can be seen as successive layers inside a model (same for the last line). The red boxes are the associated gradients for each of these layers, computed during the backward pass.</p>
@@ -294,7 +297,10 @@
         <p>Using this snippet [TODO: link to appendix A5], we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
-        <p><img alt="llama-1b-memory.png" src="assets/images/placeholder.png" /></p>
         <p>Clearly the first step looks very different from the subsequent ones, but let’s first have a look at the general anatomy of a step: first the activations increase quickly as we do the forward pass, then during the backward pass the gradients build up and as the backward pass propagates, the stored activations used to compute the gradients are progressively cleared. Finally, we perform the optimization step during which we need all the gradients and then update the optimizer states before we start the next forward pass. </p>

         </div>
     </d-title>
     <d-byline></d-byline>
+      <d-article>
         <d-contents>
         </d-contents>
         <aside>As we’ll see later, these steps may be repeated or intertwined but for now we’ll start simple.</aside>
         <p>It looks generally like this: </p>
+        <div class="svg-container" id="svg-first_steps_simple_training"> </div>
+        <div class="info" id="info">Hover over the network elements to see their details</div>
+        <script src="../assets/images/first_steps_simple_training.js"></script>
         <p>In this figure, the boxes on the top line can be seen as successive layers inside a model (same for the last line). The red boxes are the associated gradients for each of these layers, computed during the backward pass.</p>
         <p>Using this snippet [TODO: link to appendix A5], we can understand how memory is allocated throughout training. We can see that memory utilization is not a static thing but varies a lot during training and during a training step:</p>
+        <div class="svg-container l-body-outset" id="svg-first_steps_memory_profile"> </div>
+        <script src="../assets/images/first_steps_memory_profile.js"></script>
+        <!-- <p><img alt="llama-1b-memory.png" src="assets/images/placeholder.png" /></p> -->
         <p>Clearly the first step looks very different from the subsequent ones, but let’s first have a look at the general anatomy of a step: first the activations increase quickly as we do the forward pass, then during the backward pass the gradients build up and as the backward pass propagates, the stored activations used to compute the gradients are progressively cleared. Finally, we perform the optimization step during which we need all the gradients and then update the optimizer states before we start the next forward pass. </p>