Spaces:
Running
Running
Elie Bakouch
commited on
Commit
·
9d6d725
1
Parent(s):
9f7c465
fix typo seq len
Browse files- dist/index.html +1 -1
- src/index.html +1 -1
dist/index.html
CHANGED
@@ -518,7 +518,7 @@
|
|
518 |
|
519 |
<p>For the exact derivation of the numbers, you can follow this original NVIDIA paper on recomputation <d-cite bibtex-key="korthikanti2022recomputation"></d-cite>, it essentially requires you to do some accounting of all the sizes of intermediate activations between each operation in a transformer layer.</p>
|
520 |
|
521 |
-
<p>An interesting observation here is
|
522 |
|
523 |
<div class="l-body-outset" id="fragment-memusage_activations"></div>
|
524 |
<!-- <script>
|
|
|
518 |
|
519 |
<p>For the exact derivation of the numbers, you can follow this original NVIDIA paper on recomputation <d-cite bibtex-key="korthikanti2022recomputation"></d-cite>, it essentially requires you to do some accounting of all the sizes of intermediate activations between each operation in a transformer layer.</p>
|
520 |
|
521 |
+
<p>An interesting observation here is that memory usage is not static for a given model; rather, it scales linearly with the batch size and quadratically with the sequence length. This means the activation memory is the part which will blow up when we increase our batch size or train with longer sequences. We can use this equation to look at how memory usage changes for various sequence lengths for example for Llama models (<code>bs=1</code>):</p>
|
522 |
|
523 |
<div class="l-body-outset" id="fragment-memusage_activations"></div>
|
524 |
<!-- <script>
|
src/index.html
CHANGED
@@ -518,7 +518,7 @@
|
|
518 |
|
519 |
<p>For the exact derivation of the numbers, you can follow this original NVIDIA paper on recomputation <d-cite bibtex-key="korthikanti2022recomputation"></d-cite>, it essentially requires you to do some accounting of all the sizes of intermediate activations between each operation in a transformer layer.</p>
|
520 |
|
521 |
-
<p>An interesting observation here is
|
522 |
|
523 |
<div class="l-body-outset" id="fragment-memusage_activations"></div>
|
524 |
<!-- <script>
|
|
|
518 |
|
519 |
<p>For the exact derivation of the numbers, you can follow this original NVIDIA paper on recomputation <d-cite bibtex-key="korthikanti2022recomputation"></d-cite>, it essentially requires you to do some accounting of all the sizes of intermediate activations between each operation in a transformer layer.</p>
|
520 |
|
521 |
+
<p>An interesting observation here is that memory usage is not static for a given model; rather, it scales linearly with the batch size and quadratically with the sequence length. This means the activation memory is the part which will blow up when we increase our batch size or train with longer sequences. We can use this equation to look at how memory usage changes for various sequence lengths for example for Llama models (<code>bs=1</code>):</p>
|
522 |
|
523 |
<div class="l-body-outset" id="fragment-memusage_activations"></div>
|
524 |
<!-- <script>
|