File size: 2,182 Bytes
bec5fb2 e6a8b5a 6798fe4 3ff731f 6798fe4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
---
license: apache-2.0
---
# tangled-llama-j-128k-v0.1
## Train Tokenizer
```bash
python -B train_tokenizer.py
```
Tokenizer training log:
```
Resolving data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 132/132 [00:00<00:00, 266.56it/s]
Loading dataset shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 18/18 [00:05<00:00, 3.24it/s]
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 133/133 [00:00<00:00, 306844.02it/s]
[00:21:52] Pre-processing sequences ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 0 / 0
[00:00:48] Tokenize words ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 25635525 / 25635525
[00:01:17] Count pairs ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 25635525 / 25635525
[00:06:07] Compute merges ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 32066 / 32066
```
## Pretrain
```bash
python -B prepare_pretrain_dataset.py
```
```bash
CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt pretrain --config pretrain-model.yaml
```
|