Upload model
Browse files
README.md
CHANGED
@@ -1,3 +1,47 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
pipeline_tag: text-generation
|
6 |
+
tags:
|
7 |
+
- nvidia
|
8 |
+
- Megatron-LM
|
9 |
+
- Mamba
|
10 |
+
- Mamba-2
|
11 |
+
- SSM
|
12 |
+
- 8B
|
13 |
+
library_name: Megatron-LM
|
14 |
+
---
|
15 |
+
|
16 |
+
# An Empirical Study of Mamba-based Language Models
|
17 |
+
|
18 |
+
[Documentation](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba)   [Paper](https://arxiv.org/abs/2406.07887)   [Models](https://huggingface.co/collections/nvidia/ssms-666a362c5c3bb7e4a6bcfb9c)
|
19 |
+
|
20 |
+
## Overview
|
21 |
+
We release the 8B-parameter [Mamba-2](https://arxiv.org/abs/2405.21060) and Mamba-2-Hybrid model (made of Mamba-2, attention, and MLP layers) trained for the paper [An Empirical Study of Mamba-based Language Models.](https://arxiv.org/abs/2406.07887). These models were trained for 3.5T tokens with a sequence length of 4K. These models can be compared to the released 8B-parameter Transformer trained on the same data with the same hyperparameters. We also release the 32K and 128K long-context extensions of Mamba-2-Hybrid.
|
22 |
+
|
23 |
+
### Model Version(s)
|
24 |
+
|
25 |
+
`mamba2-8b-3t-4k`: Pure 8B-parameter base Mamba-2 model trained on 3.5T tokens with 4K sequence length.
|
26 |
+
|
27 |
+
### Toolkit
|
28 |
+
[Megatron-LM Framework](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba)
|
29 |
+
|
30 |
+
# Citations
|
31 |
+
|
32 |
+
See more details in our paper:
|
33 |
+
|
34 |
+
[An Empirical Study of Mamba-based Language Models.](https://arxiv.org/abs/2406.07887)
|
35 |
+
|
36 |
+
_Roger Waleffe, Wonmin Byeon, Duncan Riach, Brandon Norick, Vijay Korthikanti, Tri Dao, Albert Gu, Ali Hatamizadeh, Sudhakar Singh, Deepak Narayanan, Garvit Kulshreshtha, Vartika Singh, Jared Casper, Jan Kautz, Mohammad Shoeybi, Bryan Catanzaro._ (2024)
|
37 |
+
|
38 |
+
Please cite the paper as follows if you use the models from this repo:
|
39 |
+
|
40 |
+
```bibtex
|
41 |
+
@article{waleffe2024anempirical,
|
42 |
+
title = {An Empirical Study of Mamba-based Language Models},
|
43 |
+
author = {Roger Waleffe and Wonmin Byeon and Duncan Riach and Brandon Norick and Vijay Korthikanti and Tri Dao and Albert Gu and Ali Hatamizadeh and Sudhakar Singh and Deepak Narayanan and Garvit Kulshreshtha and Vartika Singh and Jared Casper and Jan Kautz and Mohammad Shoeybi and Bryan Catanzaro},
|
44 |
+
year = {2024},
|
45 |
+
journal = {arXiv preprint arXiv: 2406.07887}
|
46 |
+
}
|
47 |
+
```
|
latest_checkpointed_iteration.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
release
|
mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:5862e2f71caf762bc9845662be5fec2867deb58d874568235a02a36c5111cd09
|
3 |
+
size 4573028
|
release/mp_rank_00/model_optim_rng.pt
ADDED
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
1 |
+
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:47c2766f6aad89d73beafbeaecb334aab902d7370906d081764a90bb7a8bbbcb
|
3 |
+
size 16474189490
|