|
---
|
|
license: apache-2.0
|
|
---
|
|
|
|
# tangled-llama-j-128k-v0.1
|
|
|
|
## Train Tokenizer
|
|
|
|
```bash
|
|
python -B train_tokenizer.py
|
|
```
|
|
|
|
Tokenizer training log:
|
|
```
|
|
Resolving data files: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 132/132 [00:00<00:00, 266.56it/s]
|
|
Loading dataset shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 18/18 [00:05<00:00, 3.24it/s]
|
|
Resolving data files: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 133/133 [00:00<00:00, 306844.02it/s]
|
|
[00:21:52] Pre-processing sequences ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 0 / 0
|
|
[00:00:48] Tokenize words ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 25635525 / 25635525
|
|
[00:01:17] Count pairs ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 25635525 / 25635525
|
|
[00:06:07] Compute merges ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 32066 / 32066
|
|
```
|
|
|
|
## Pretrain
|
|
|
|
```bash
|
|
python -B prepare_pretrain_dataset.py
|
|
```
|
|
|
|
```bash
|
|
CUDA_VISIBLE_DEVICES=0 CUDA_LAUNCH_BLOCKING=0 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True litgpt pretrain --config pretrain-model.yaml
|
|
```
|
|
|
|
## Chat with Pretrained model
|
|
|
|
```bash
|
|
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True CUDA_VISIBLE_DEVICES="0" litgpt chat out/pretrain/final/
|
|
```
|
|
|