Safetensors
transformers_zamba2
zamba2
BerenMillidge commited on
Commit
4918b60
·
verified ·
1 Parent(s): 24947ee

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -12
README.md CHANGED
@@ -62,9 +62,7 @@ print(tokenizer.decode(outputs[0]))
62
 
63
  Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared transformer blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
64
 
65
- <center>
66
- <img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/XrEIEBxd0fqIgh3LyArAV.png" width="300" alt="Zamba architecture">
67
- </center>
68
 
69
 
70
  ## Performance
@@ -73,22 +71,16 @@ Zamba2-1.2B achieves leading and state-of-the-art performance among models of <3
73
 
74
  Zamba2-1.2B's high performance and small inference compute and memory footprint renders it an ideal generalist model for on-device applications.
75
 
76
- <center>
77
- <img src="https://cdn-uploads.huggingface.co/production/uploads/65c05e75c084467acab2f84a/U7VD9PYLj3XcEjgV08sP5.png" width="700" alt="Zamba performance">
78
- </center>
79
 
80
- <center>
81
- <img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/3u8k7tcRi-oC_ltGhdHAk.png" width="800" alt="Zamba performance">
82
- </center>
83
 
84
  Time to First Token (TTFT) | Output Generation
85
  :-------------------------:|:-------------------------:
86
  ![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/BmE8X6tDNVw5OJcbZt8sZ.png) | ![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/wECc9cItK1FW1MOMGSLrp.png)
87
 
88
 
89
- <center>
90
- <img src="https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/nhoss41xlzfEBZzcQXI6z.png" width="700" alt="Zamba inference and memory cost">
91
- </center>
92
 
93
  ## Notice
94
 
 
62
 
63
  Zamba2-1.2B utilizes and extends our original Zamba hybrid SSM-attention architecture. The core Zamba architecture consists of a backbone of Mamba layers interleaved with one or more shared attention layers (one shared attention in Zamba1, two in Zamba2). This attention has shared weights to minimize the parameter cost of the model. We find that concatenating the original model embeddings to the input to this attention block improves performance, likely due to better maintenance of information across depth. The Zamba2 architecture also applies LoRA projection matrices to the shared transformer blocks to gain some additional expressivity in each block and allow each shared block to specialize slightly to its own unique position while keeping the additional parameter overhead small.
64
 
65
+ TODO
 
 
66
 
67
 
68
  ## Performance
 
71
 
72
  Zamba2-1.2B's high performance and small inference compute and memory footprint renders it an ideal generalist model for on-device applications.
73
 
74
+ TODO
 
 
75
 
76
+ TODO
 
 
77
 
78
  Time to First Token (TTFT) | Output Generation
79
  :-------------------------:|:-------------------------:
80
  ![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/BmE8X6tDNVw5OJcbZt8sZ.png) | ![](https://cdn-uploads.huggingface.co/production/uploads/65bc13717c6ad1994b6619e9/wECc9cItK1FW1MOMGSLrp.png)
81
 
82
 
83
+ TODO
 
 
84
 
85
  ## Notice
86