microsoft
/

GRIN-MoE

Text Generation

Inference Endpoints

Model card Files Files and versions Community

LiyuanLucasLiu commited on Sep 18

Commit

f8b8aeb

•

1 Parent(s): 70f36dd

uploaded tech report and revised readme

Files changed (3) hide show

.gitattributes +1 -0
GRIN_MoE.pdf +3 -0
README.md +4 -6

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+*.pdf filter=lfs diff=lfs merge=lfs -text

GRIN_MoE.pdf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:39e878f28a2bdd362f0bbe0bc0fa2ef9b827551d74e9a617a18e2b3923abb322
+size 1971199

README.md CHANGED Viewed

@@ -17,16 +17,14 @@ library_name: transformers
 <h1 align="center">	&#128513; MoE</h1>
 <h4 align="center">GRIN: <em>GR</em>adient-<em>IN</em>formed MoE</h4>
 <p align="center">
-<a href="https://huggingface.co/microsoft/GRIN-MoE">Hugging Face</a>&nbsp | &nbsp <a href="https://arxiv.org/abs/2304.08612"> Tech Report</a>&nbsp | &nbsp  <a href="https://github.com/microsoft/GRIN-MoE/blob/main/LICENSE">License</a>&nbsp  | &nbsp <a href="https://github.com/microsoft/GRIN-MoE">Github</a> &nbsp | &nbsp <a href="https://huggingface.co/microsoft/GRIN-MoE#usage">Get Started</a>&nbsp
 <br>
-GRIN MoE is a top2 16x3.8B MoE model.
-It achieves exceptionally good performance across a diverse set of tasks, particularly in coding and mathematics tasks.
-Comparing to conventional MoE training, GRIN MoE differs in mostly two ways:
-- GRIN uses SparseMixer-v2 to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.
-- GRIN scales MoE training with neither expert parallelism nor token dropping, while the conventional MoE training employs expert parallelism and deploys token dropping.
 ## Intended Uses

 <h1 align="center">	&#128513; MoE</h1>
 <h4 align="center">GRIN: <em>GR</em>adient-<em>IN</em>formed MoE</h4>
 <p align="center">
+<a href="https://huggingface.co/microsoft/GRIN-MoE">Hugging Face</a>&nbsp | &nbsp <a href="https://huggingface.co/microsoft/GRIN-MoE/blob/main/GRIN_MoE.pdf"> Tech Report</a>&nbsp | &nbsp  <a href="https://huggingface.co/microsoft/GRIN-MoE/blob/main/LICENSE">License</a>&nbsp  | &nbsp <a href="https://github.com/microsoft/GRIN-MoE">Github</a> &nbsp | &nbsp <a href="https://huggingface.co/microsoft/GRIN-MoE#usage">Get Started</a>&nbsp
 <br>
+- With **only 6.6B** activate parameters, GRIN MoE achieves **exceptionally good** performance across a diverse set of tasks, particularly in coding and mathematics tasks.
+- GRIN uses **SparseMixer-v2** to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.
+- GRIN scales MoE training with **neither expert parallelism nor token dropping**, while the conventional MoE training employs expert parallelism and deploys token dropping.
 ## Intended Uses