Spaces:
Running
Running
remove old files
Browse files- blog-export-headrs.html +0 -192
- blog-export.html +0 -0
- blog-export.md +0 -0
blog-export-headrs.html
DELETED
@@ -1,192 +0,0 @@
|
|
1 |
-
<h2>The Ultra-Scale Playbook: Training LLMs on GPU Clusters</h2>
|
2 |
-
|
3 |
-
<h2>TL;DR</h2>
|
4 |
-
|
5 |
-
<h2>First Steps: Training on one GPU</h2>
|
6 |
-
|
7 |
-
<h3>Memory usage in Transformers</h3>
|
8 |
-
|
9 |
-
<h4>Memory profiling a training step</h4>
|
10 |
-
|
11 |
-
<h4>Weights/grads/optimizer states memory</h4>
|
12 |
-
|
13 |
-
<h4>Activations memory</h4>
|
14 |
-
|
15 |
-
<h3><strong>Activation recomputation</strong></h3>
|
16 |
-
|
17 |
-
<h3>Gradient accumulation</h3>
|
18 |
-
|
19 |
-
<h2>Data Parallelism</h2>
|
20 |
-
|
21 |
-
<h4><strong>First optimization:</strong> Overlap gradient synchronization with backward pass</h4>
|
22 |
-
|
23 |
-
<h4><strong>Second optimization:</strong> Bucketing gradients</h4>
|
24 |
-
|
25 |
-
<h4><strong>Third optimization: I</strong>nterplay with gradient accumulation</h4>
|
26 |
-
|
27 |
-
<h3>Revisit global batch size</h3>
|
28 |
-
|
29 |
-
<h3>Our journey up to now</h3>
|
30 |
-
|
31 |
-
<h3>ZeRO (<strong>Ze</strong>ro <strong>R</strong>edundancy <strong>O</strong>ptimizer)</h3>
|
32 |
-
|
33 |
-
<h4>Memory usage revisited</h4>
|
34 |
-
|
35 |
-
<h4>ZeRO-1: Partitioning Optimizer States</h4>
|
36 |
-
|
37 |
-
<h4>ZeRO-2: Adding <strong>Gradient Partitioning</strong></h4>
|
38 |
-
|
39 |
-
<h4>ZeRO-3: Adding Parameter <strong>Partitioning</strong></h4>
|
40 |
-
|
41 |
-
<h2>Tensor Parallelism</h2>
|
42 |
-
|
43 |
-
<h3>Tensor Parallelism in a Transformer Block</h3>
|
44 |
-
|
45 |
-
<h3>Sequence Parallelism</h3>
|
46 |
-
|
47 |
-
<h2>Context Parallelism</h2>
|
48 |
-
|
49 |
-
<h3>Introducing Context Parallelism</h3>
|
50 |
-
|
51 |
-
<h3>Discovering Ring Attention</h3>
|
52 |
-
|
53 |
-
<h3>Zig-Zag Ring Attention – A Balanced Compute Implementation</h3>
|
54 |
-
|
55 |
-
<h2></h2>
|
56 |
-
|
57 |
-
<h2>Pipeline Parallelism</h2>
|
58 |
-
|
59 |
-
<h3>Splitting layers on various nodes - All forward, all backward</h3>
|
60 |
-
|
61 |
-
<h3>One-forward-one-backward and LLama 3.1 schemes</h3>
|
62 |
-
|
63 |
-
<h3>Interleaving stages</h3>
|
64 |
-
|
65 |
-
<h3>Zero Bubble and DualPipe</h3>
|
66 |
-
|
67 |
-
<h2>Expert parallelism</h2>
|
68 |
-
|
69 |
-
<h2>5D parallelism in a nutshell</h2>
|
70 |
-
|
71 |
-
<h2>How to Find the Best Training Configuration</h2>
|
72 |
-
|
73 |
-
<h2>Diving in the GPUs – fusing, threading, mixing</h2>
|
74 |
-
|
75 |
-
<h4>A primer on GPU</h4>
|
76 |
-
|
77 |
-
<h3>How to improve performance with Kernels ?</h3>
|
78 |
-
|
79 |
-
<h4>Memory Coalescing</h4>
|
80 |
-
|
81 |
-
<h4>Tiling</h4>
|
82 |
-
|
83 |
-
<h4>Thread Coarsening</h4>
|
84 |
-
|
85 |
-
<h4>Minimizing Control Divergence</h4>
|
86 |
-
|
87 |
-
<h3>Flash Attention 1-3</h3>
|
88 |
-
|
89 |
-
<h3>Fused Kernels</h3>
|
90 |
-
|
91 |
-
<h3>Mixed Precision Training</h3>
|
92 |
-
|
93 |
-
<h4>FP16 and BF16 training</h4>
|
94 |
-
|
95 |
-
<h4>FP8 pretraining</h4>
|
96 |
-
|
97 |
-
<h2>Conclusion</h2>
|
98 |
-
|
99 |
-
<h3>What you learned</h3>
|
100 |
-
|
101 |
-
<h3>What we learned</h3>
|
102 |
-
|
103 |
-
<h3>What’s next?</h3>
|
104 |
-
|
105 |
-
<h2>References</h2>
|
106 |
-
|
107 |
-
<h3>Landmark LLM Scaling Papers</h3>
|
108 |
-
|
109 |
-
<h3>Training Frameworks</h3>
|
110 |
-
|
111 |
-
<h3>Debugging</h3>
|
112 |
-
|
113 |
-
<h3>Distribution Techniques</h3>
|
114 |
-
|
115 |
-
<h3>CUDA Kernels</h3>
|
116 |
-
|
117 |
-
<h3>Hardware</h3>
|
118 |
-
|
119 |
-
<h3>Others</h3>
|
120 |
-
|
121 |
-
<h2>Appendix</h2>
|
122 |
-
|
123 |
-
<h3>A0: Parallel Programming Crash Course</h3>
|
124 |
-
|
125 |
-
<h4>Broadcast</h4>
|
126 |
-
|
127 |
-
<h4>Reduce & AllReduce</h4>
|
128 |
-
|
129 |
-
<h4><strong>A quick focus on Ring All-Reduce</strong></h4>
|
130 |
-
|
131 |
-
<h4>Gather & AllGather</h4>
|
132 |
-
|
133 |
-
<h4>Scatter & ReduceScatter</h4>
|
134 |
-
|
135 |
-
<h4>Barrier</h4>
|
136 |
-
|
137 |
-
<h4>NCCL: NVIDIA Collective Communications Library</h4>
|
138 |
-
|
139 |
-
<h3>A1: Profiling</h3>
|
140 |
-
|
141 |
-
<h4>Kernels</h4>
|
142 |
-
|
143 |
-
<h2>Print a table of the profiling results, sorted by total CUDA time, limited to the top 10 entries</h2>
|
144 |
-
|
145 |
-
<h2>include <torch/extension.h></h2>
|
146 |
-
|
147 |
-
<h2>include <cuda.h></h2>
|
148 |
-
|
149 |
-
<h2>include <cuda_runtime.h></h2>
|
150 |
-
|
151 |
-
<h2>Load and compile the CUDA extension</h2>
|
152 |
-
|
153 |
-
<h2>Define input tensors</h2>
|
154 |
-
|
155 |
-
<h2>Run the CUDA kernel</h2>
|
156 |
-
|
157 |
-
<h3>A2: TP Backward pass</h3>
|
158 |
-
|
159 |
-
<h3>A3: ZeRO-R</h3>
|
160 |
-
|
161 |
-
<h4>$P_a:$ Partitioned Activation Checkpointing</h4>
|
162 |
-
|
163 |
-
<h4><strong>$C_B:$ Constant Size Buffers</strong></h4>
|
164 |
-
|
165 |
-
<h4><strong>$M_D$: Memory Defragmentation</strong></h4>
|
166 |
-
|
167 |
-
<h4>Communication Analysis of ZeRO-R</h4>
|
168 |
-
|
169 |
-
<h3>A5. Memory profile</h3>
|
170 |
-
|
171 |
-
<h2>Set up optimizer</h2>
|
172 |
-
|
173 |
-
<h3>TP: Practical PyTorch Implementation</h3>
|
174 |
-
|
175 |
-
<h2>This is the <code>f</code> function in the paper: https://arxiv.org/abs/1909.08053</h2>
|
176 |
-
|
177 |
-
<h2>core logic of Column Parallel linear</h2>
|
178 |
-
|
179 |
-
<h4>Gelu code</h4>
|
180 |
-
|
181 |
-
<h4>Interconnect</h4>
|
182 |
-
|
183 |
-
<h3>How to profile your code</h3>
|
184 |
-
|
185 |
-
<h3>Formulas for compute / comms the balanhe balance</h3>
|
186 |
-
|
187 |
-
<h3>Integrating Context Parallelism with TP/SP</h3>
|
188 |
-
|
189 |
-
<h3>The nanotron FP8 recipe</h3>
|
190 |
-
|
191 |
-
<h2>Overlapping computation and communication</h2>
|
192 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
blog-export.html
DELETED
The diff for this file is too large to render.
See raw diff
|
|
blog-export.md
DELETED
The diff for this file is too large to render.
See raw diff
|
|