lvwerra HF staff commited on
Commit
8d5f916
·
1 Parent(s): 2a9ca3d

remove old files

Browse files
Files changed (3) hide show
  1. blog-export-headrs.html +0 -192
  2. blog-export.html +0 -0
  3. blog-export.md +0 -0
blog-export-headrs.html DELETED
@@ -1,192 +0,0 @@
1
- <h2>The Ultra-Scale Playbook: Training LLMs on GPU Clusters</h2>
2
-
3
- <h2>TL;DR</h2>
4
-
5
- <h2>First Steps: Training on one GPU</h2>
6
-
7
- <h3>Memory usage in Transformers</h3>
8
-
9
- <h4>Memory profiling a training step</h4>
10
-
11
- <h4>Weights/grads/optimizer states memory</h4>
12
-
13
- <h4>Activations memory</h4>
14
-
15
- <h3><strong>Activation recomputation</strong></h3>
16
-
17
- <h3>Gradient accumulation</h3>
18
-
19
- <h2>Data Parallelism</h2>
20
-
21
- <h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
22
-
23
- <h4><strong>Second optimization:</strong> Bucketing gradients</h4>
24
-
25
- <h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
26
-
27
- <h3>Revisit global batch size</h3>
28
-
29
- <h3>Our journey up to now</h3>
30
-
31
- <h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
32
-
33
- <h4>Memory usage revisited</h4>
34
-
35
- <h4>ZeRO-1: Partitioning Optimizer States</h4>
36
-
37
- <h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
38
-
39
- <h4>ZeRO-3: Adding Parameter <strong>Partitioning</strong></h4>
40
-
41
- <h2>Tensor Parallelism</h2>
42
-
43
- <h3>Tensor Parallelism in a Transformer Block</h3>
44
-
45
- <h3>Sequence Parallelism</h3>
46
-
47
- <h2>Context Parallelism</h2>
48
-
49
- <h3>Introducing Context Parallelism</h3>
50
-
51
- <h3>Discovering Ring Attention</h3>
52
-
53
- <h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
54
-
55
- <h2></h2>
56
-
57
- <h2>Pipeline Parallelism</h2>
58
-
59
- <h3>Splitting layers on various nodes - All forward, all backward</h3>
60
-
61
- <h3>One-forward-one-backward and LLama 3.1 schemes</h3>
62
-
63
- <h3>Interleaving stages</h3>
64
-
65
- <h3>Zero Bubble and DualPipe</h3>
66
-
67
- <h2>Expert parallelism</h2>
68
-
69
- <h2>5D parallelism in a nutshell</h2>
70
-
71
- <h2>How to Find the Best Training Configuration</h2>
72
-
73
- <h2>Diving in the GPUs – fusing, threading, mixing</h2>
74
-
75
- <h4>A primer on GPU</h4>
76
-
77
- <h3>How to improve performance with Kernels ?</h3>
78
-
79
- <h4>Memory Coalescing</h4>
80
-
81
- <h4>Tiling</h4>
82
-
83
- <h4>Thread Coarsening</h4>
84
-
85
- <h4>Minimizing Control Divergence</h4>
86
-
87
- <h3>Flash Attention 1-3</h3>
88
-
89
- <h3>Fused Kernels</h3>
90
-
91
- <h3>Mixed Precision Training</h3>
92
-
93
- <h4>FP16 and BF16 training</h4>
94
-
95
- <h4>FP8 pretraining</h4>
96
-
97
- <h2>Conclusion</h2>
98
-
99
- <h3>What you learned</h3>
100
-
101
- <h3>What we learned</h3>
102
-
103
- <h3>What’s next?</h3>
104
-
105
- <h2>References</h2>
106
-
107
- <h3>Landmark LLM Scaling Papers</h3>
108
-
109
- <h3>Training Frameworks</h3>
110
-
111
- <h3>Debugging</h3>
112
-
113
- <h3>Distribution Techniques</h3>
114
-
115
- <h3>CUDA Kernels</h3>
116
-
117
- <h3>Hardware</h3>
118
-
119
- <h3>Others</h3>
120
-
121
- <h2>Appendix</h2>
122
-
123
- <h3>A0: Parallel Programming Crash Course</h3>
124
-
125
- <h4>Broadcast</h4>
126
-
127
- <h4>Reduce &amp; AllReduce</h4>
128
-
129
- <h4><strong>A quick focus on Ring All-Reduce</strong></h4>
130
-
131
- <h4>Gather &amp; AllGather</h4>
132
-
133
- <h4>Scatter &amp; ReduceScatter</h4>
134
-
135
- <h4>Barrier</h4>
136
-
137
- <h4>NCCL: NVIDIA Collective Communications Library</h4>
138
-
139
- <h3>A1: Profiling</h3>
140
-
141
- <h4>Kernels</h4>
142
-
143
- <h2>Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries</h2>
144
-
145
- <h2>include <torch/extension.h></h2>
146
-
147
- <h2>include <cuda.h></h2>
148
-
149
- <h2>include <cuda_runtime.h></h2>
150
-
151
- <h2>Load and compile the CUDA extension</h2>
152
-
153
- <h2>Define input tensors</h2>
154
-
155
- <h2>Run the CUDA kernel</h2>
156
-
157
- <h3>A2: TP Backward pass</h3>
158
-
159
- <h3>A3: ZeRO-R</h3>
160
-
161
- <h4>$P_a:$ Partitioned Activation Checkpointing</h4>
162
-
163
- <h4><strong>$C_B:$ Constant Size Buffers</strong></h4>
164
-
165
- <h4><strong>$M_D$: Memory Defragmentation</strong></h4>
166
-
167
- <h4>Communication Analysis of ZeRO-R</h4>
168
-
169
- <h3>A5. Memory profile</h3>
170
-
171
- <h2>Set up optimizer</h2>
172
-
173
- <h3>TP: Practical PyTorch Implementation</h3>
174
-
175
- <h2>This is the <code>f</code> function in the paper: https://arxiv.org/abs/1909.08053</h2>
176
-
177
- <h2>core logic of Column Parallel linear</h2>
178
-
179
- <h4>Gelu code</h4>
180
-
181
- <h4>Interconnect</h4>
182
-
183
- <h3>How to profile your code</h3>
184
-
185
- <h3>Formulas for compute / comms the balanhe balance</h3>
186
-
187
- <h3>Integrating Context Parallelism with TP/SP</h3>
188
-
189
- <h3>The nanotron FP8 recipe</h3>
190
-
191
- <h2>Overlapping computation and communication</h2>
192
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
blog-export.html DELETED
The diff for this file is too large to render. See raw diff
 
blog-export.md DELETED
The diff for this file is too large to render. See raw diff