lvwerra HF staff commited on
Commit
2a9ca3d
·
1 Parent(s): f0752ea

add picotron code snippets

Browse files
Files changed (4) hide show
  1. dist/index.html +69 -12
  2. dist/style.css +0 -1
  3. src/index.html +69 -12
  4. src/style.css +0 -1
dist/index.html CHANGED
@@ -474,13 +474,9 @@
474
 
475
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
476
 
477
- <p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
478
-
479
  <aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
480
 
481
- <p>TODO: embed naive DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L10-L60">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L10-L60</a></p>
482
-
483
- <p>TODO: embed bucket DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171</a></p>
484
 
485
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
486
 
@@ -510,7 +506,18 @@
510
 
511
  <p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
512
 
513
- <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. </p>
 
 
 
 
 
 
 
 
 
 
 
514
 
515
  <p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
516
 
@@ -519,6 +526,18 @@
519
 
520
  <p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
521
 
 
 
 
 
 
 
 
 
 
 
 
 
522
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
523
 
524
  <h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
@@ -749,12 +768,36 @@
749
 
750
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
751
 
 
 
 
 
 
 
 
 
 
 
 
752
  <p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
753
 
754
  <p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
755
 
756
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
758
  <h3>Tensor Parallelism in a Transformer Block</h3>
759
 
760
  <p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
@@ -924,10 +967,6 @@
924
  </tr>
925
  </tbody>
926
  </table>
927
-
928
- <p>You can find an example of implementation of both column and row linear TP in picotron:
929
-
930
- <a href="https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py">https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py</a> </p>
931
 
932
  <p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
933
 
@@ -1102,8 +1141,17 @@
1102
 
1103
  <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1104
 
1105
- <p>You can find the full implementation of the AFAB pipeline in picotron: https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/pipeline_parallel/pipeline_parallel.py#L54-L83</p>
1106
 
 
 
 
 
 
 
 
 
 
1107
  <p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
1108
 
1109
  <d-math block>
@@ -1132,8 +1180,17 @@
1132
 
1133
  <p>Here is the example training loop from the above gist:</p>
1134
 
1135
- <p>You can find the full implementation in picotron as well: https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/pipeline_parallel/pipeline_parallel.py#L85-L145</p>
1136
 
 
 
 
 
 
 
 
 
 
1137
  <p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
1138
 
1139
  <h3>Interleaving stages</h3>
 
474
 
475
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
476
 
 
 
477
  <aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
478
 
479
+ <p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
 
 
480
 
481
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
482
 
 
506
 
507
  <p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
508
 
509
+ <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
510
+
511
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
512
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
513
+ 👉 Naive DP implementation with overlap in Picotron (Click to expand)
514
+ </summary>
515
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
516
+ <script
517
+ src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L10-L60&style=github&type=code&showBorder=off&showLineNumbers=on&showFileMeta=on&showCopy=on&showFullPath=on">
518
+ </script>
519
+ </div>
520
+ </details>
521
 
522
  <p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
523
 
 
526
 
527
  <p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
528
 
529
+ <p>Here's the code implementation with bucketing:</p>
530
+
531
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
532
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
533
+ 👉 Bucket DP implementation in Picotron (Click to expand)
534
+ </summary>
535
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
536
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L62-L171&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on">
537
+ </script>
538
+ </div>
539
+ </details>
540
+
541
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
542
 
543
  <h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
 
768
 
769
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
770
 
771
+ <p>Here's the code implementation of column wise tensor parallelism:</p>
772
+
773
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
774
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
775
+ 👉 Column parallel TP implementation in Picotron (Click to expand)
776
+ </summary>
777
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
778
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L54-L123&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
779
+ </div>
780
+ </details>
781
+
782
  <p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
783
 
784
  <p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
785
 
786
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
787
 
788
+ <p>Here's the implementation for row-wise tensor parallelism:</p>
789
+
790
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
791
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
792
+ 👉 Row parallel TP implementation in Picotron (Click to expand)
793
+ </summary>
794
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
795
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L125-L189&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
796
+ </div>
797
+ </details>
798
+
799
+ <p>Now that we have the basic building blocks of TP, let's have a look at how we can effectively combine them inside a transformer layer!</p>
800
+
801
  <h3>Tensor Parallelism in a Transformer Block</h3>
802
 
803
  <p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
 
967
  </tr>
968
  </tbody>
969
  </table>
 
 
 
 
970
 
971
  <p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
972
 
 
1141
 
1142
  <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1143
 
1144
+ <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
1145
 
1146
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1147
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
1148
+ 👉 AFAB PP implementation in Picotron (Click to expand)
1149
+ </summary>
1150
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
1151
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L54-L83&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
1152
+ </div>
1153
+ </details>
1154
+
1155
  <p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
1156
 
1157
  <d-math block>
 
1180
 
1181
  <p>Here is the example training loop from the above gist:</p>
1182
 
1183
+ <p>You can find the full implementation in picotron as well:</p>
1184
 
1185
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1186
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
1187
+ 👉 1F1B PP implementation in Picotron (Click to expand)
1188
+ </summary>
1189
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
1190
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L85-L145&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
1191
+ </div>
1192
+ </details>
1193
+
1194
  <p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
1195
 
1196
  <h3>Interleaving stages</h3>
dist/style.css CHANGED
@@ -20,7 +20,6 @@
20
  margin-top: 0px;
21
  padding: 0px;
22
  }
23
-
24
  .plotly_caption {
25
  font-style: italic;
26
  margin-top: 10px;
 
20
  margin-top: 0px;
21
  padding: 0px;
22
  }
 
23
  .plotly_caption {
24
  font-style: italic;
25
  margin-top: 10px;
src/index.html CHANGED
@@ -474,13 +474,9 @@
474
 
475
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
476
 
477
- <p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
478
-
479
  <aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
480
 
481
- <p>TODO: embed naive DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L10-L60">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L10-L60</a></p>
482
-
483
- <p>TODO: embed bucket DP: <a href="https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171">https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/data_parallel/data_parallel.py#L62-L171</a></p>
484
 
485
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
486
 
@@ -510,7 +506,18 @@
510
 
511
  <p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
512
 
513
- <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. </p>
 
 
 
 
 
 
 
 
 
 
 
514
 
515
  <p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
516
 
@@ -519,6 +526,18 @@
519
 
520
  <p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
521
 
 
 
 
 
 
 
 
 
 
 
 
 
522
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
523
 
524
  <h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
@@ -749,12 +768,36 @@
749
 
750
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
751
 
 
 
 
 
 
 
 
 
 
 
 
752
  <p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
753
 
754
  <p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
755
 
756
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
757
 
 
 
 
 
 
 
 
 
 
 
 
 
 
758
  <h3>Tensor Parallelism in a Transformer Block</h3>
759
 
760
  <p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
@@ -924,10 +967,6 @@
924
  </tr>
925
  </tbody>
926
  </table>
927
-
928
- <p>You can find an example of implementation of both column and row linear TP in picotron:
929
-
930
- <a href="https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py">https://github.com/huggingface/picotron/blob/main/picotron/tensor_parallel/tensor_parallel.py</a> </p>
931
 
932
  <p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
933
 
@@ -1102,8 +1141,17 @@
1102
 
1103
  <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1104
 
1105
- <p>You can find the full implementation of the AFAB pipeline in picotron: https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/pipeline_parallel/pipeline_parallel.py#L54-L83</p>
1106
 
 
 
 
 
 
 
 
 
 
1107
  <p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
1108
 
1109
  <d-math block>
@@ -1132,8 +1180,17 @@
1132
 
1133
  <p>Here is the example training loop from the above gist:</p>
1134
 
1135
- <p>You can find the full implementation in picotron as well: https://github.com/huggingface/picotron/blob/0035cce0e04afd6192763b11efe50010d8ad0f71/picotron/pipeline_parallel/pipeline_parallel.py#L85-L145</p>
1136
 
 
 
 
 
 
 
 
 
 
1137
  <p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
1138
 
1139
  <h3>Interleaving stages</h3>
 
474
 
475
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
476
 
 
 
477
  <aside>If you are not familiar with distributed communications patterns like broadcast, gather or all-reduce we put together a small crash course in the Appendix [TODO Link].</aside>
478
 
479
+ <p>This involves our first “distributed communication” primitive: <em><strong>all-reduce</em></strong> which handles the synchronization and communication between GPU instances and nodes.</p>
 
 
480
 
481
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
482
 
 
506
 
507
  <p><img alt="image.png" src="/assets/images/placeholder.png"/></p>
508
 
509
+ <p>Overlapping computation and communication reduces the time spent waiting for gradient synchronization across the entire model. Gradient synchronization can occur (at least partially) in parallel with backward pass, significantly speeding up data parallelism. Here's a full implementation of naive DP with synchronization overlap:</p>
510
+
511
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
512
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
513
+ 👉 Naive DP implementation with overlap in Picotron (Click to expand)
514
+ </summary>
515
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
516
+ <script
517
+ src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L10-L60&style=github&type=code&showBorder=off&showLineNumbers=on&showFileMeta=on&showCopy=on&showFullPath=on">
518
+ </script>
519
+ </div>
520
+ </details>
521
 
522
  <p>This is our first example of “<em>overlapping computation and communication</em>” which we will discuss several times in this blog post and is an essential technique to maximal scaling efficiency. Let's have a look how we can further improve the DP efficiency!</p>
523
 
 
526
 
527
  <p>We can even go further with optimizing DP. For a given number of parameters to synchronize, GPU operations like collective communications are often more efficient when performing few calls on large tensors rather than many calls on smaller tensors. Therefore, instead of performing independent all-reduce for each gradient, we can group gradients into buckets and launch a single all-reduce for all the gradients within the same bucket. Think of it like packing items into boxes before shipping—it's more efficient to send a few big boxes than many small ones. By performing a single all-reduce operation for each bucket, we can significantly reduce communication overhead and speed up the communication operation.</p>
528
 
529
+ <p>Here's the code implementation with bucketing:</p>
530
+
531
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
532
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
533
+ 👉 Bucket DP implementation in Picotron (Click to expand)
534
+ </summary>
535
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
536
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fdata_parallel%2Fdata_parallel.py%23L62-L171&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on">
537
+ </script>
538
+ </div>
539
+ </details>
540
+
541
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
542
 
543
  <h4><strong>Third optimization: </strong>Interplay with gradient accumulation</h4>
 
768
 
769
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
770
 
771
+ <p>Here's the code implementation of column wise tensor parallelism:</p>
772
+
773
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
774
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
775
+ 👉 Column parallel TP implementation in Picotron (Click to expand)
776
+ </summary>
777
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
778
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L54-L123&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
779
+ </div>
780
+ </details>
781
+
782
  <p>The second option is called row-wise sharding (also called <strong><em>row-linear</em></strong>): As the attentive reader might guess, row-linear means that we split the weight matrix into chunks of rows. However, this also requires us to split the inputs, which needs a <strong><em>scatter</em></strong> operation rather than a broadcast as used in column-linear sharding. The results on each worker are already in the right shape but need to be summed for the final result, thus requiring an all-reduce operation in this scenario.</p>
783
 
784
  <p>We see here our fourth distributed primitive: <strong><em>scatter</em></strong>!</p>
785
 
786
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
787
 
788
+ <p>Here's the implementation for row-wise tensor parallelism:</p>
789
+
790
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
791
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
792
+ 👉 Row parallel TP implementation in Picotron (Click to expand)
793
+ </summary>
794
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
795
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F1004ae37b87887cde597c9060fb067faa060bafe%2Fpicotron%2Ftensor_parallel%2Ftensor_parallel.py%23L125-L189&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
796
+ </div>
797
+ </details>
798
+
799
+ <p>Now that we have the basic building blocks of TP, let's have a look at how we can effectively combine them inside a transformer layer!</p>
800
+
801
  <h3>Tensor Parallelism in a Transformer Block</h3>
802
 
803
  <p>To come up with a strategy to follow, let’s move from a toy example to a real model building block. A Transformer model is made of two main building blocks : Feedforward layers (MLP) and Multi-Head Attention (MHA). We can apply tensor parallelism to both.</p>
 
967
  </tr>
968
  </tbody>
969
  </table>
 
 
 
 
970
 
971
  <p>By using sequence parallelism, we can achieve even greater activation memory savings, allowing us to push our batch size and sequence length further than what would be possible with tensor parallelism alone. Let's see what that means for our previous 70B model example:</p>
972
 
 
1141
 
1142
  <p>The above schedule is called the <strong><em>all-forward-all-backward (AFAB)</em></strong> schedule as we first do all forward passes and then only all-backward passes. The advantage is that forward and backward steps are still generally sequential and so preserving the general order of model training. This make this option rather simple to implement.</p>
1143
 
1144
+ <p>You can find the full implementation of the AFAB pipeline in picotron:</p>
1145
 
1146
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1147
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
1148
+ 👉 AFAB PP implementation in Picotron (Click to expand)
1149
+ </summary>
1150
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
1151
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L54-L83&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
1152
+ </div>
1153
+ </details>
1154
+
1155
  <p>Let’s estimate the bubble in this example. The difference with our first example is that the ideal time to process <d-math>m</d-math> microbatches is now <d-math>t_{id} = m*(t_f+t_b)</d-math>:</p>
1156
 
1157
  <d-math block>
 
1180
 
1181
  <p>Here is the example training loop from the above gist:</p>
1182
 
1183
+ <p>You can find the full implementation in picotron as well:</p>
1184
 
1185
+ <details style="background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 6px; margin: 1em 0;">
1186
+ <summary style="padding: 12px; cursor: pointer; user-select: none; background: #f3f4f6; border-bottom: 1px solid #d0d7de;">
1187
+ 👉 1F1B PP implementation in Picotron (Click to expand)
1188
+ </summary>
1189
+ <div class="code-embed-container" style="margin: 0; border-radius: 0; overflow-x: scroll; width: max-content; min-width: 100%; font-size: 8px;"></div>
1190
+ <script src="https://emgithub.com/embed-v2.js?target=https%3A%2F%2Fgithub.com%2Fhuggingface%2Fpicotron%2Fblob%2F0035cce0e04afd6192763b11efe50010d8ad0f71%2Fpicotron%2Fpipeline_parallel%2Fpipeline_parallel.py%23L85-L145&style=github&type=code&showBorder=on&showLineNumbers=on&showFileMeta=on&showFullPath=on&showCopy=on"></script>
1191
+ </div>
1192
+ </details>
1193
+
1194
  <p>So reordering a bit the computations helped a lot improving the memory pressure from activations. Could we get even better performance with more intricate schedules? Yes!</p>
1195
 
1196
  <h3>Interleaving stages</h3>
src/style.css CHANGED
@@ -20,7 +20,6 @@
20
  margin-top: 0px;
21
  padding: 0px;
22
  }
23
-
24
  .plotly_caption {
25
  font-style: italic;
26
  margin-top: 10px;
 
20
  margin-top: 0px;
21
  padding: 0px;
22
  }
 
23
  .plotly_caption {
24
  font-style: italic;
25
  margin-top: 10px;