jdeschena
/

sdtt

jdeschena commited on 15 days ago

Commit

3144d4a

•

1 Parent(s): 6a0105d

Update README.md

Files changed (1) hide show

README.md CHANGED Viewed

@@ -9,6 +9,11 @@ metrics:
 - mauve
 ---
 ## Using SDTT
 - We released 3 groups of models:
     1. The **baseline students** distilled with the `kld`, `mse` and `tvd` objectives, distilled from a model trained for 1M steps.

 - mauve
 ---
+# Self-Distillation Through Time (SDTT)
+SDTT is a distillation method for diffusion language models. Recent diffusion language models such as [SEDD](https://huggingface.co/louaaron/sedd-small) or [MDLM](https://huggingface.co/kuleshov-group/mdlm-owt) achieve great results.
+However, because they cannot use KV-caching (non-causal architecture), it is slow to sample from them. Therefore, we devise a novel distillation method to reduce the inference latency of discrete diffusion models.
+After distillation, we can sample up to 8x faster than GPT-2 (that uses KV-caching). Find more details below and on [our GitHub repo](https://github.com/jdeschena/sdtt).
 ## Using SDTT
 - We released 3 groups of models:
     1. The **baseline students** distilled with the `kld`, `mse` and `tvd` objectives, distilled from a model trained for 1M steps.