Spaces:
Running
Running
add picotron code snippets
Browse files- dist/index.html +69 -12
- dist/style.css +0 -1
- src/index.html +69 -12
- src/style.css +0 -1
dist/index.html
CHANGED
@@ -474,13 +474,9 @@
|
|
474 |
|
475 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
476 |
|
477 |
-
<p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
|
478 |
-
|
479 |
<aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
|
480 |
|
481 |
-
<p>
|
482 |
-
|
483 |
-
<p>TODO: embed bucket DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171</a></p>
|
484 |
|
485 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
486 |
|
@@ -510,7 +506,18 @@
|
|
510 |
|
511 |
<p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
|
512 |
|
513 |
-
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
514 |
|
515 |
<p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
|
516 |
|
@@ -519,6 +526,18 @@
|
|
519 |
|
520 |
<p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
|
521 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
522 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
523 |
|
524 |
<h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
|
@@ -749,12 +768,36 @@
|
|
749 |
|
750 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
751 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
752 |
<p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
|
753 |
|
754 |
<p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
|
755 |
|
756 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
757 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
758 |
<h3>Tensor Parallelism in a Transformer Block</h3>
|
759 |
|
760 |
<p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
|
@@ -924,10 +967,6 @@
|
|
924 |
</tr>
|
925 |
</tbody>
|
926 |
</table>
|
927 |
-
|
928 |
-
<p>You can find an example of implementation of both column and row linear TP in picotron:
|
929 |
-
|
930 |
-
<a href="https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py">https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py</a> </p>
|
931 |
|
932 |
<p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
|
933 |
|
@@ -1102,8 +1141,17 @@
|
|
1102 |
|
1103 |
<p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
|
1104 |
|
1105 |
-
<p>You can find the full implementation of the AFAB pipeline in picotron
|
1106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1107 |
<p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
|
1108 |
|
1109 |
<d-math block>
|
@@ -1132,8 +1180,17 @@
|
|
1132 |
|
1133 |
<p>Here is the example training loop from the above gist:</p>
|
1134 |
|
1135 |
-
<p>You can find the full implementation in picotron as well
|
1136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1137 |
<p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
|
1138 |
|
1139 |
<h3>Interleaving stages</h3>
|
|
|
474 |
|
475 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
476 |
|
|
|
|
|
477 |
<aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
|
478 |
|
479 |
+
<p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
|
|
|
|
|
480 |
|
481 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
482 |
|
|
|
506 |
|
507 |
<p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
|
508 |
|
509 |
+
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
|
510 |
+
|
511 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
512 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
513 |
+
👉 Naive DP implementation with overlap in Picotron (Click to expand)
|
514 |
+
</summary>
|
515 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
516 |
+
<script
|
517 |
+
src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L10-L60&style=github&type=code&showBorder=off&showLineNumbers=on&showFileMeta=on&showCopy=on&showFullPath=on">
|
518 |
+
</script>
|
519 |
+
</div>
|
520 |
+
</details>
|
521 |
|
522 |
<p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
|
523 |
|
|
|
526 |
|
527 |
<p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
|
528 |
|
529 |
+
<p>Here's the code implementation with bucketing:</p>
|
530 |
+
|
531 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
532 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
533 |
+
👉 Bucket DP implementation in Picotron (Click to expand)
|
534 |
+
</summary>
|
535 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
536 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L62-L171&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on">
|
537 |
+
</script>
|
538 |
+
</div>
|
539 |
+
</details>
|
540 |
+
|
541 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
542 |
|
543 |
<h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
|
|
|
768 |
|
769 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
770 |
|
771 |
+
<p>Here's the code implementation of column wise tensor parallelism:</p>
|
772 |
+
|
773 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
774 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
775 |
+
👉 Column parallel TP implementation in Picotron (Click to expand)
|
776 |
+
</summary>
|
777 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
778 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L54-L123&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
779 |
+
</div>
|
780 |
+
</details>
|
781 |
+
|
782 |
<p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
|
783 |
|
784 |
<p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
|
785 |
|
786 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
787 |
|
788 |
+
<p>Here's the implementation for row-wise tensor parallelism:</p>
|
789 |
+
|
790 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
791 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
792 |
+
👉 Row parallel TP implementation in Picotron (Click to expand)
|
793 |
+
</summary>
|
794 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
795 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L125-L189&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
796 |
+
</div>
|
797 |
+
</details>
|
798 |
+
|
799 |
+
<p>Now that we have the basic building blocks of TP, let's have a look at how we can effectively combine them inside a transformer layer!</p>
|
800 |
+
|
801 |
<h3>Tensor Parallelism in a Transformer Block</h3>
|
802 |
|
803 |
<p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
|
|
|
967 |
</tr>
|
968 |
</tbody>
|
969 |
</table>
|
|
|
|
|
|
|
|
|
970 |
|
971 |
<p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
|
972 |
|
|
|
1141 |
|
1142 |
<p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
|
1143 |
|
1144 |
+
<p>You can find the full implementation of the AFAB pipeline in picotron:</p>
|
1145 |
|
1146 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
1147 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
1148 |
+
👉 AFAB PP implementation in Picotron (Click to expand)
|
1149 |
+
</summary>
|
1150 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
1151 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L54-L83&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
1152 |
+
</div>
|
1153 |
+
</details>
|
1154 |
+
|
1155 |
<p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
|
1156 |
|
1157 |
<d-math block>
|
|
|
1180 |
|
1181 |
<p>Here is the example training loop from the above gist:</p>
|
1182 |
|
1183 |
+
<p>You can find the full implementation in picotron as well:</p>
|
1184 |
|
1185 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
1186 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
1187 |
+
👉 1F1B PP implementation in Picotron (Click to expand)
|
1188 |
+
</summary>
|
1189 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
1190 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L85-L145&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
1191 |
+
</div>
|
1192 |
+
</details>
|
1193 |
+
|
1194 |
<p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
|
1195 |
|
1196 |
<h3>Interleaving stages</h3>
|
dist/style.css
CHANGED
@@ -20,7 +20,6 @@
|
|
20 |
margin-top: 0px;
|
21 |
padding: 0px;
|
22 |
}
|
23 |
-
|
24 |
.plotly_caption {
|
25 |
font-style: italic;
|
26 |
margin-top: 10px;
|
|
|
20 |
margin-top: 0px;
|
21 |
padding: 0px;
|
22 |
}
|
|
|
23 |
.plotly_caption {
|
24 |
font-style: italic;
|
25 |
margin-top: 10px;
|
src/index.html
CHANGED
@@ -474,13 +474,9 @@
|
|
474 |
|
475 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
476 |
|
477 |
-
<p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
|
478 |
-
|
479 |
<aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
|
480 |
|
481 |
-
<p>
|
482 |
-
|
483 |
-
<p>TODO: embed bucket DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171</a></p>
|
484 |
|
485 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
486 |
|
@@ -510,7 +506,18 @@
|
|
510 |
|
511 |
<p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
|
512 |
|
513 |
-
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
514 |
|
515 |
<p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
|
516 |
|
@@ -519,6 +526,18 @@
|
|
519 |
|
520 |
<p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
|
521 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
522 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
523 |
|
524 |
<h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
|
@@ -749,12 +768,36 @@
|
|
749 |
|
750 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
751 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
752 |
<p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
|
753 |
|
754 |
<p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
|
755 |
|
756 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
757 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
758 |
<h3>Tensor Parallelism in a Transformer Block</h3>
|
759 |
|
760 |
<p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
|
@@ -924,10 +967,6 @@
|
|
924 |
</tr>
|
925 |
</tbody>
|
926 |
</table>
|
927 |
-
|
928 |
-
<p>You can find an example of implementation of both column and row linear TP in picotron:
|
929 |
-
|
930 |
-
<a href="https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py">https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py</a> </p>
|
931 |
|
932 |
<p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
|
933 |
|
@@ -1102,8 +1141,17 @@
|
|
1102 |
|
1103 |
<p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
|
1104 |
|
1105 |
-
<p>You can find the full implementation of the AFAB pipeline in picotron
|
1106 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1107 |
<p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
|
1108 |
|
1109 |
<d-math block>
|
@@ -1132,8 +1180,17 @@
|
|
1132 |
|
1133 |
<p>Here is the example training loop from the above gist:</p>
|
1134 |
|
1135 |
-
<p>You can find the full implementation in picotron as well
|
1136 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1137 |
<p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
|
1138 |
|
1139 |
<h3>Interleaving stages</h3>
|
|
|
474 |
|
475 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
476 |
|
|
|
|
|
477 |
<aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
|
478 |
|
479 |
+
<p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
|
|
|
|
|
480 |
|
481 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
482 |
|
|
|
506 |
|
507 |
<p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
|
508 |
|
509 |
+
<p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
|
510 |
+
|
511 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
512 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
513 |
+
👉 Naive DP implementation with overlap in Picotron (Click to expand)
|
514 |
+
</summary>
|
515 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
516 |
+
<script
|
517 |
+
src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L10-L60&style=github&type=code&showBorder=off&showLineNumbers=on&showFileMeta=on&showCopy=on&showFullPath=on">
|
518 |
+
</script>
|
519 |
+
</div>
|
520 |
+
</details>
|
521 |
|
522 |
<p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
|
523 |
|
|
|
526 |
|
527 |
<p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
|
528 |
|
529 |
+
<p>Here's the code implementation with bucketing:</p>
|
530 |
+
|
531 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
532 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
533 |
+
👉 Bucket DP implementation in Picotron (Click to expand)
|
534 |
+
</summary>
|
535 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
536 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L62-L171&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on">
|
537 |
+
</script>
|
538 |
+
</div>
|
539 |
+
</details>
|
540 |
+
|
541 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
542 |
|
543 |
<h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
|
|
|
768 |
|
769 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
770 |
|
771 |
+
<p>Here's the code implementation of column wise tensor parallelism:</p>
|
772 |
+
|
773 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
774 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
775 |
+
👉 Column parallel TP implementation in Picotron (Click to expand)
|
776 |
+
</summary>
|
777 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
778 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L54-L123&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
779 |
+
</div>
|
780 |
+
</details>
|
781 |
+
|
782 |
<p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
|
783 |
|
784 |
<p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
|
785 |
|
786 |
<p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
|
787 |
|
788 |
+
<p>Here's the implementation for row-wise tensor parallelism:</p>
|
789 |
+
|
790 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
791 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
792 |
+
👉 Row parallel TP implementation in Picotron (Click to expand)
|
793 |
+
</summary>
|
794 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
795 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L125-L189&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
796 |
+
</div>
|
797 |
+
</details>
|
798 |
+
|
799 |
+
<p>Now that we have the basic building blocks of TP, let's have a look at how we can effectively combine them inside a transformer layer!</p>
|
800 |
+
|
801 |
<h3>Tensor Parallelism in a Transformer Block</h3>
|
802 |
|
803 |
<p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
|
|
|
967 |
</tr>
|
968 |
</tbody>
|
969 |
</table>
|
|
|
|
|
|
|
|
|
970 |
|
971 |
<p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
|
972 |
|
|
|
1141 |
|
1142 |
<p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
|
1143 |
|
1144 |
+
<p>You can find the full implementation of the AFAB pipeline in picotron:</p>
|
1145 |
|
1146 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
1147 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
1148 |
+
👉 AFAB PP implementation in Picotron (Click to expand)
|
1149 |
+
</summary>
|
1150 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
1151 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L54-L83&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
1152 |
+
</div>
|
1153 |
+
</details>
|
1154 |
+
|
1155 |
<p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
|
1156 |
|
1157 |
<d-math block>
|
|
|
1180 |
|
1181 |
<p>Here is the example training loop from the above gist:</p>
|
1182 |
|
1183 |
+
<p>You can find the full implementation in picotron as well:</p>
|
1184 |
|
1185 |
+
<details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
|
1186 |
+
<summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
|
1187 |
+
👉 1F1B PP implementation in Picotron (Click to expand)
|
1188 |
+
</summary>
|
1189 |
+
<div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
|
1190 |
+
<script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L85-L145&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
|
1191 |
+
</div>
|
1192 |
+
</details>
|
1193 |
+
|
1194 |
<p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
|
1195 |
|
1196 |
<h3>Interleaving stages</h3>
|
src/style.css
CHANGED
@@ -20,7 +20,6 @@
|
|
20 |
margin-top: 0px;
|
21 |
padding: 0px;
|
22 |
}
|
23 |
-
|
24 |
.plotly_caption {
|
25 |
font-style: italic;
|
26 |
margin-top: 10px;
|
|
|
20 |
margin-top: 0px;
|
21 |
padding: 0px;
|
22 |
}
|
|
|
23 |
.plotly_caption {
|
24 |
font-style: italic;
|
25 |
margin-top: 10px;
|