Spaces:

nanotron
/

ultrascale-playbook

Running

App Files Files Community

xrsrke/link_nanotron_fp8_appexdix

#21

by neuralink HF staff - opened 11 days ago

base: refs/heads/main

←

from: refs/pr/21

Discussion Files changed

+44

-2

Files changed (4) hide show

dist/bibliography.bib +6 -0
dist/index.html +16 -1
src/bibliography.bib +6 -0
src/index.html +16 -1

dist/bibliography.bib CHANGED Viewed

@@ -510,4 +510,10 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
       archivePrefix={arXiv},
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2309.14322},
 }

       archivePrefix={arXiv},
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2309.14322},
+}
+@software{nanotronfp8,
+  title = {nanotron's FP8 implementation},
+  author = {nanotron},
+  url = {https://github.com/huggingface/nanotron/pull/70},
+  year = {2024}
 }

dist/index.html CHANGED Viewed

@@ -2215,7 +2215,7 @@
             </tbody>
            </table>
-        <p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation in [TODO: link to appendix]. </p>
         <p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
@@ -2382,6 +2382,16 @@
             <p>Training language models across compute clusters with DiLoCo.</p>
         </div>
         <h3>Debugging</h3>
         <div>
@@ -2499,6 +2509,11 @@
             <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
             <p>Investigation into long context training in terms of data and training cost.</p>
         </div>
         <h2>Appendix</h2>

             </tbody>
            </table>
+        <p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation<d-cite bibtex-key="nanotronfp8"></d-cite>. </p>
         <p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
             <p>Training language models across compute clusters with DiLoCo.</p>
         </div>
+        <div>
+            <a href="https://github.com/kakaobrain/torchgpipe"><strong>torchgpipe</strong></a>
+            <p>A GPipe implementation in PyTorch.</p>
+        </div>
+        <div>
+            <a href="https://github.com/EleutherAI/oslo"><strong>OSLO</strong></a>
+            <p>OSLO: Open Source for Large-scale Optimization.</p>
+        </div>
         <h3>Debugging</h3>
         <div>
             <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
             <p>Investigation into long context training in terms of data and training cost.</p>
         </div>
+        <div>
+            <a href="https://github.com/tunib-ai/large-scale-lm-tutorials"><strong>TunibAI's 3D parallelism tutorial</strong></a>
+            <p>Large-scale language modeling tutorials with PyTorch.</p>
+        </div>
         <h2>Appendix</h2>

src/bibliography.bib CHANGED Viewed

@@ -510,4 +510,10 @@ url = {https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md}
       archivePrefix={arXiv},
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2309.14322},
 }

       archivePrefix={arXiv},
       primaryClass={cs.LG},
       url={https://arxiv.org/abs/2309.14322},
+}
+@software{nanotronfp8,
+  title = {nanotron's FP8 implementation},
+  author = {nanotron},
+  url = {https://github.com/huggingface/nanotron/pull/70},
+  year = {2024}
 }

src/index.html CHANGED Viewed

@@ -2215,7 +2215,7 @@
             </tbody>
            </table>
-        <p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation in [TODO: link to appendix]. </p>
         <p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
@@ -2381,6 +2381,16 @@
             <a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
             <p>Training language models across compute clusters with DiLoCo.</p>
         </div>
         <h3>Debugging</h3>
@@ -2499,6 +2509,11 @@
             <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
             <p>Investigation into long context training in terms of data and training cost.</p>
         </div>
         <h2>Appendix</h2>

             </tbody>
            </table>
+        <p>Overall, FP8 is still an experimental technique and methods are evolving, but will likely become the standard soon replacing bf16 mixed-precision. To follow public implementations of this, please head to the nanotron’s implementation<d-cite bibtex-key="nanotronfp8"></d-cite>. </p>
         <p>In the future, Blackwell, the next generation of NVIDIA chips, <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">have been announced </a> to support FP4 training, further speeding up training but without a doubt also introducing a new training stability challenge.</p>
             <a href="https://github.com/PrimeIntellect-ai/OpenDiLoCo"><strong>DiLoco</strong></a>
             <p>Training language models across compute clusters with DiLoCo.</p>
         </div>
+        <div>
+            <a href="https://github.com/kakaobrain/torchgpipe"><strong>torchgpipe</strong></a>
+            <p>torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models.</p>
+        </div>
+        <div>
+            <a href="https://github.com/EleutherAI/oslo"><strong>OSLO</strong></a>
+            <p>OSLO: Open Source for Large-scale Optimization.</p>
+        </div>
         <h3>Debugging</h3>
             <a href="https://www.harmdevries.com/post/context-length/"><strong>Harm's blog for long context</strong></a>
             <p>Investigation into long context training in terms of data and training cost.</p>
         </div>
+        <div>
+            <a href="https://github.com/tunib-ai/large-scale-lm-tutorials"><strong>TunibAI's 3D parallelism tutorial</strong></a>
+            <p>Large-scale language modeling tutorials with PyTorch.</p>
+        </div>
         <h2>Appendix</h2>