Andrew DalPino commited on
Commit
ea296b2
·
1 Parent(s): 9c2b560

Export small model at 850 epochs

Browse files
README.md CHANGED
@@ -14,11 +14,11 @@ tags:
14
  ---
15
  # LightGPT
16
 
17
- LightGPT is a lightweight generative pretrained Transformer (GPT) language model for the people! Built using PyTorch and trained on the Fineweb and SmolTalk datasets, LightGPT can answer questions, follow instructions, summarize documents, chat, and more. Best of all, the model weights *and* code are fully open-source for you to customize, improve upon, and share with the world.
18
 
19
  ## Features
20
 
21
- - **No positional embeddings**: LightGPT aims to be a more parsimonious model by completely removing positional embeddings from the architecture. This allows for a variable context length without complex model surgery. Despite having no positional embeddings (NoPE), LightGPT performs better at context length generalization than the best relative embeddings (ALiBi, RoPE, T5) offering good performance even at 2X of the trained context length.
22
 
23
  - **Low Memory Utilization**: LightGPT lets you progressively employ training-time memory optimizations such as fully-sharded data-parallel (FSDP), activation checkpointing, mixed precision, and low-memory optimizer updates that allow you to train larger models on smaller hardware.
24
 
@@ -26,7 +26,7 @@ LightGPT is a lightweight generative pretrained Transformer (GPT) language model
26
 
27
  ## Suggested Pretraining Configurations
28
 
29
- Below is a table of some suggested pretraining configurations but feel free to experiment with settings on your own. See the `model_sizing.ipynb` notebook to estimate the memory and compute requirements for your model configuration.
30
 
31
  | Name | Vocab. Size | Embedding Dim. | Attn. Heads | Layers | Parameters | Training Tokens |
32
  |---|---|---|---|---|---|---|
@@ -39,7 +39,7 @@ Below is a table of some suggested pretraining configurations but feel free to e
39
 
40
  We typically recommend a training `block size` (also referred to as context length) of between 1024 to 4096 for standard models and 4096 or higher for long-context applications such as conversational chatbots, retrieval augmented generation, and chain-of-thought prompting.
41
 
42
- **Note**: LightGPT can be trained using variable block sizes since the architecture does not depend on any discrete positional embeddings. This flexibility allows you to gradually extend the context length.
43
 
44
  ## Install Project Dependencies
45
 
@@ -91,19 +91,19 @@ torchrun --standalone --nnodes=1 --nproc-per-node=8 pretrain.py --batch_size=16
91
  | --token_encoding | "r50k_base" | str | The Tiktoken encoding scheme to use when tokenizing the dataset. Options include `r50k_base`, `p50k_base`, `cl100k_base`, and `o200k_base`. |
92
  | --dataset_path | "./datasets" | str | The path to the preprocessed dataset files on disk. |
93
  | --num_dataset_processes | 8 | int | The number of processes (CPUs) to use to process the dataset. |
94
- | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
95
- | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
96
- | --tokens_per_sample | 1024 | int | The number of tokens to pack into a single training sequence. This is sometimes called the context length or block size. |
97
  | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
98
  | --num_epochs | 1686 | int | The number of epochs to train for. |
99
  | --learning_rate | 1e-2 | float | The learning rate of the Adafactor optimizer. |
100
  | --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
101
  | --low_memory_optimizer | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
102
- | --max_gradient_norm | 1.0 | float | Clip gradients above this threshold before stepping. |
103
  | --eval_interval | 10 | int | Evaluate the model after this many epochs on the testing set. |
104
  | --embedding_dimensions | 1024 | int | The dimensionality of the token embeddings. |
105
- | --num_attention_heads | 16 | int | The number of attention heads within every block. |
106
- | --num_hidden_layers | 24 | int | The number of attention/MLP blocks within the hidden layer of the network. |
107
  | --feed_forward_ratio | 4 | (1, 2, 4) | The ratio of hidden neurons to embedding dimensions in the MLP layers of the network. |
108
  | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
109
  | --activation_checkpointing | False | bool | Should we use activation checkpointing? This will drastically reduce memory utilization during training at the cost of recomputing the forward pass. |
@@ -117,7 +117,7 @@ torchrun --standalone --nnodes=1 --nproc-per-node=8 pretrain.py --batch_size=16
117
 
118
  ### Training Dashboard
119
 
120
- We use TensorBoard to capture and display pretraining events such as loss and gradient norm updates. To launch the dashboard server run the following command from the terminal.
121
 
122
  ```
123
  tensorboard --logdir=./runs
@@ -132,7 +132,7 @@ Then navigate to the dashboard using your favorite web browser.
132
  | Argument | Default | Type | Description |
133
  |---|---|---|---|
134
  | --base_model_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint on disk. |
135
- | --max_tokens_per_sample | 2048 | int | The maximum number of tokens to pack into a single training sequence. |
136
  | --mask_input | False | bool | Should we mask the input part of the training sequences i.e. only train on the supervised output? |
137
  | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
138
  | --gradient_accumulation_steps | 64 | int | The number of batches to pass through the network before updating the weights. |
@@ -174,7 +174,7 @@ python generate.py --top_k=500 --top_p=0.9
174
  | --device | "cuda" | string | The device to run the computation on. |
175
  | --seed | None | int | The seed for the random number generator. |
176
 
177
- We also provide a script that samples entire sequences rather than single tokens independently which we call `beam_search.py`. Beam Search maintains a list of the top `beam_width` sequence candidates and outputs the top `num_candidates` completed sequences with the highest overall priority. It is a form of greedy search that works well for some things like text summarization and translation but often results in less natural responses as natural language follows a more stochastic process.
178
 
179
  ```
180
  python beam_search.py --beam_width=16 --num_candidates=3
@@ -194,7 +194,8 @@ python beam_search.py --beam_width=16 --num_candidates=3
194
  | --seed | None | int | The seed for the random number generator. |
195
 
196
  ## References:
197
- >- G. Penedo, et al. The FineWeb Datasts: Decanting the Web for the Finest Text Data at Scale, 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks.
 
198
  >- A. Radford, et al. Language Models are Unsupervised Multitask Learners, OpenAI, 2019.
199
  >- T. Brown, et al. Language Models are Few-Shot Learners. OpenAI, 2020.
200
  >- A. Kazemnejad, et al. The Impact of Positional Encoding on Length Generalization in Transformers, 37th Conference on Neural Information Processing Systems (NeurIPS 2023).
 
14
  ---
15
  # LightGPT
16
 
17
+ LightGPT is a lightweight generative pretrained Transformer (GPT) language model for the people! Built using [PyTorch](https://pytorch.org/) and trained on HuggingFace's [Fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) and [SmolTalk](https://huggingface.co/datasets/HuggingFaceTB/smoltalk) datasets, LightGPT can answer questions, follow instructions, summarize documents, chat, and more. Best of all, the model weights *and* code are fully open-source for you to customize, improve upon, and share with the world.
18
 
19
  ## Features
20
 
21
+ - **No positional embeddings**: LightGPT aims to be a more parsimonious model by completely removing positional embeddings from the architecture. This allows for a variable context length without complex model surgery. Despite having no positional embeddings (NoPE), LightGPT performs better at context length generalization than the best relative embeddings (ALiBi, RoPE, T5) offering good performance even when operating within 2X the trained context window.
22
 
23
  - **Low Memory Utilization**: LightGPT lets you progressively employ training-time memory optimizations such as fully-sharded data-parallel (FSDP), activation checkpointing, mixed precision, and low-memory optimizer updates that allow you to train larger models on smaller hardware.
24
 
 
26
 
27
  ## Suggested Pretraining Configurations
28
 
29
+ Below is a table of some suggested model pretraining configurations but feel free to experiment with settings on your own. See the `model_sizing.ipynb` notebook to estimate the memory and compute requirements for your model configuration.
30
 
31
  | Name | Vocab. Size | Embedding Dim. | Attn. Heads | Layers | Parameters | Training Tokens |
32
  |---|---|---|---|---|---|---|
 
39
 
40
  We typically recommend a training `block size` (also referred to as context length) of between 1024 to 4096 for standard models and 4096 or higher for long-context applications such as conversational chatbots, retrieval augmented generation, and chain-of-thought prompting.
41
 
42
+ **Note**: LightGPT can be trained using variable block sizes since the architecture does not depend on any discrete positional embeddings. This flexibility allows you to progressively extend the context window during training.
43
 
44
  ## Install Project Dependencies
45
 
 
91
  | --token_encoding | "r50k_base" | str | The Tiktoken encoding scheme to use when tokenizing the dataset. Options include `r50k_base`, `p50k_base`, `cl100k_base`, and `o200k_base`. |
92
  | --dataset_path | "./datasets" | str | The path to the preprocessed dataset files on disk. |
93
  | --num_dataset_processes | 8 | int | The number of processes (CPUs) to use to process the dataset. |
94
+ | --batch_size | 1 | int | The number of samples of size `tokens_per_sample` to pass through the network at a time. |
95
+ | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the model weights. |
96
+ | --tokens_per_sample | 1024 | int | The number of tokens to pack into a single training sequence. This is sometimes called the block size or context length. |
97
  | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
98
  | --num_epochs | 1686 | int | The number of epochs to train for. |
99
  | --learning_rate | 1e-2 | float | The learning rate of the Adafactor optimizer. |
100
  | --rms_decay | -0.8 | float | The decay rate of the RMS coefficient of the Adafactor optimizer. |
101
  | --low_memory_optimizer | False | bool | Should the optimizer reduce its memory consumption in exchange for a slightly slower runtime? |
102
+ | --max_gradient_norm | 1.0 | float | Clip gradients above this threshold norm before stepping. |
103
  | --eval_interval | 10 | int | Evaluate the model after this many epochs on the testing set. |
104
  | --embedding_dimensions | 1024 | int | The dimensionality of the token embeddings. |
105
+ | --num_attention_heads | 16 | int | The number of attention heads within every attention layer. |
106
+ | --num_hidden_layers | 24 | int | The number of attention/MLP blocks within the body of the network. |
107
  | --feed_forward_ratio | 4 | (1, 2, 4) | The ratio of hidden neurons to embedding dimensions in the MLP layers of the network. |
108
  | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
109
  | --activation_checkpointing | False | bool | Should we use activation checkpointing? This will drastically reduce memory utilization during training at the cost of recomputing the forward pass. |
 
117
 
118
  ### Training Dashboard
119
 
120
+ We use [TensorBoard](https://www.tensorflow.org/tensorboard) to capture and display pretraining events such as loss and gradient norm updates. To launch the dashboard server run the following command from the terminal.
121
 
122
  ```
123
  tensorboard --logdir=./runs
 
132
  | Argument | Default | Type | Description |
133
  |---|---|---|---|
134
  | --base_model_path | "./checkpoints/checkpoint.pt" | string | The path to the base checkpoint on disk. |
135
+ | --max_tokens_per_sample | 1024 | int | The maximum number of tokens to pack into a single training sequence. |
136
  | --mask_input | False | bool | Should we mask the input part of the training sequences i.e. only train on the supervised output? |
137
  | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
138
  | --gradient_accumulation_steps | 64 | int | The number of batches to pass through the network before updating the weights. |
 
174
  | --device | "cuda" | string | The device to run the computation on. |
175
  | --seed | None | int | The seed for the random number generator. |
176
 
177
+ We also provide a script that samples entire sequences rather than single tokens independently which we call `beam_search.py`. Beam search maintains a list of the top `beam_width` candidate sequences and outputs the top `num_candidates` completed sequences with the highest overall priority. It is a form of greedy search that works well for some things like text summarization and translation but often results in less natural sounding responses and may even repeat certain sequences.
178
 
179
  ```
180
  python beam_search.py --beam_width=16 --num_candidates=3
 
194
  | --seed | None | int | The seed for the random number generator. |
195
 
196
  ## References:
197
+ >- G. Penedo, et al. The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale, 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks.
198
+ >- L. B. Allal, et al. SmolLM2 - with great data, comes great performance, 2024.
199
  >- A. Radford, et al. Language Models are Unsupervised Multitask Learners, OpenAI, 2019.
200
  >- T. Brown, et al. Language Models are Few-Shot Learners. OpenAI, 2020.
201
  >- A. Kazemnejad, et al. The Impact of Positional Encoding on Length Generalization in Transformers, 37th Conference on Neural Information Processing Systems (NeurIPS 2023).
exports/lightgpt-small.onnx CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:da678557a217a65910e837d53bb432cdf2e0178253ad4bd8feff99dcd9c1a238
3
  size 1414537243
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:63d92b1899c7c88154552d3d26f8d8c50ff3668af3e95d25a080b6463b37c581
3
  size 1414537243
exports/lightgpt-small.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:548812323a768536533086c5f6a7b39ea658500b4a89a4b8d3612e4fb8c58421
3
  size 1414029160
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2e59c20457013b18bc3851b8de70cc0cae2e995ad9254c39eaeded25ab583832
3
  size 1414029160
generate.py CHANGED
@@ -22,7 +22,7 @@ def main():
22
  "--checkpoint_path", default="./checkpoints/checkpoint.pt", type=str
23
  )
24
  parser.add_argument("--lora_path", default=None, type=str)
25
- parser.add_argument("--max_tokens", default=2000, type=int)
26
  parser.add_argument("--context_length", default=1024, type=int)
27
  parser.add_argument("--temperature", default=1.0, type=float)
28
  parser.add_argument("--top_k", default=500, type=int)
 
22
  "--checkpoint_path", default="./checkpoints/checkpoint.pt", type=str
23
  )
24
  parser.add_argument("--lora_path", default=None, type=str)
25
+ parser.add_argument("--max_tokens", default=1000, type=int)
26
  parser.add_argument("--context_length", default=1024, type=int)
27
  parser.add_argument("--temperature", default=1.0, type=float)
28
  parser.add_argument("--top_k", default=500, type=int)