xrsrke/link_nanotron_fp8_appexdix

#21
by neuralink HF staff - opened
dist/bibliography.bib CHANGED
@@ -510,4 +510,10 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
 
 
 
 
 
 
513
  }
 
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
513
+ }
514
+ @software{nanotronfp8,
515
+ title = {nanotron's FP8 implementation},
516
+ author = {nanotron},
517
+ url = {https://github.com/huggingface/nanotron/pull/70},
518
+ year = {2024}
519
  }
dist/index.html CHANGED
@@ -2215,7 +2215,7 @@
2215
  </tbody>
2216
  </table>
2217
 
2218
- <p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation in [TODO: link to appendix]. </p>
2219
 
2220
  <p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
2221
 
@@ -2382,6 +2382,16 @@
2382
  <p>Training language models across compute clusters with DiLoCo.</p>
2383
  </div>
2384
 
 
 
 
 
 
 
 
 
 
 
2385
  <h3>Debugging</h3>
2386
 
2387
  <div>
@@ -2499,6 +2509,11 @@
2499
  <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
2500
  <p>Investigation into long context training in terms of data and training cost.</p>
2501
  </div>
 
 
 
 
 
2502
 
2503
  <h2>Appendix</h2>
2504
 
 
2215
  </tbody>
2216
  </table>
2217
 
2218
+ <p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation<d-cite bibtex-key="nanotronfp8"></d-cite>. </p>
2219
 
2220
  <p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
2221
 
 
2382
  <p>Training language models across compute clusters with DiLoCo.</p>
2383
  </div>
2384
 
2385
+ <div>
2386
+ <a href="https://github.com/kakaobrain/torchgpipe"><strong>torchgpipe</strong></a>
2387
+ <p>A GPipe implementation in PyTorch.</p>
2388
+ </div>
2389
+
2390
+ <div>
2391
+ <a href="https://github.com/EleutherAI/oslo"><strong>OSLO</strong></a>
2392
+ <p>OSLO: Open Source for Large-scale Optimization.</p>
2393
+ </div>
2394
+
2395
  <h3>Debugging</h3>
2396
 
2397
  <div>
 
2509
  <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
2510
  <p>Investigation into long context training in terms of data and training cost.</p>
2511
  </div>
2512
+
2513
+ <div>
2514
+ <a href="https://github.com/tunib-ai/large-scale-lm-tutorials"><strong>TunibAI's 3D parallelism tutorial</strong></a>
2515
+ <p>Large-scale language modeling tutorials with PyTorch.</p>
2516
+ </div>
2517
 
2518
  <h2>Appendix</h2>
2519
 
src/bibliography.bib CHANGED
@@ -510,4 +510,10 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
 
 
 
 
 
 
513
  }
 
510
  archivePrefix={arXiv},
511
  primaryClass={cs.LG},
512
  url={https://arxiv.org/abs/2309.14322},
513
+ }
514
+ @software{nanotronfp8,
515
+ title = {nanotron's FP8 implementation},
516
+ author = {nanotron},
517
+ url = {https://github.com/huggingface/nanotron/pull/70},
518
+ year = {2024}
519
  }
src/index.html CHANGED
@@ -2215,7 +2215,7 @@
2215
  </tbody>
2216
  </table>
2217
 
2218
- <p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation in [TODO: link to appendix]. </p>
2219
 
2220
  <p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
2221
 
@@ -2381,6 +2381,16 @@
2381
  <a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
2382
  <p>Training language models across compute clusters with DiLoCo.</p>
2383
  </div>
 
 
 
 
 
 
 
 
 
 
2384
 
2385
  <h3>Debugging</h3>
2386
 
@@ -2499,6 +2509,11 @@
2499
  <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
2500
  <p>Investigation into long context training in terms of data and training cost.</p>
2501
  </div>
 
 
 
 
 
2502
 
2503
  <h2>Appendix</h2>
2504
 
 
2215
  </tbody>
2216
  </table>
2217
 
2218
+ <p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation<d-cite bibtex-key="nanotronfp8"></d-cite>. </p>
2219
 
2220
  <p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
2221
 
 
2381
  <a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
2382
  <p>Training language models across compute clusters with DiLoCo.</p>
2383
  </div>
2384
+
2385
+ <div>
2386
+ <a href="https://github.com/kakaobrain/torchgpipe"><strong>torchgpipe</strong></a>
2387
+ <p>torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models.</p>
2388
+ </div>
2389
+
2390
+ <div>
2391
+ <a href="https://github.com/EleutherAI/oslo"><strong>OSLO</strong></a>
2392
+ <p>OSLO: Open Source for Large-scale Optimization.</p>
2393
+ </div>
2394
 
2395
  <h3>Debugging</h3>
2396
 
 
2509
  <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
2510
  <p>Investigation into long context training in terms of data and training cost.</p>
2511
  </div>
2512
+
2513
+ <div>
2514
+ <a href="https://github.com/tunib-ai/large-scale-lm-tutorials"><strong>TunibAI's 3D parallelism tutorial</strong></a>
2515
+ <p>Large-scale language modeling tutorials with PyTorch.</p>
2516
+ </div>
2517
 
2518
  <h2>Appendix</h2>
2519