andrewdalpino
/

LightGPT

@@ -45,10 +45,10 @@ python pre-train.py
 > Note that it will take a while to download and pre-process the dataset the first time that the training script is run.
-To customize the default "lightgpt-small" architecture you can adjust the `block_size`, `embedding_dimensions`, and `num_layers` arguments of the pre-training script. Refer to the `model_sizing.ipynb` notebook for an estimation of the memory and compute requirements for a particular architecture.
 ```
-python pre-train.py --block_size=2048 --embedding_dimensions=4096 --num_hidden_layers=64
 ```
 You can also adjust the `batch_size`, `learning_rate`, and `gradient_accumulation_steps` to suite your training setup.
@@ -57,7 +57,7 @@ You can also adjust the `batch_size`, `learning_rate`, and `gradient_accumulatio
 python pre-train.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=128
 ```
-For distributed training use PyTorch's [torchrun](https://pytorch.org/docs/stable/elastic/run.html) extension to launch a distributed data parallel session. The example below is for executing the training script on a single node with individual 8 GPUs.
 ```
 torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16 --gradient_accumulation_steps=128

 > Note that it will take a while to download and pre-process the dataset the first time that the training script is run.
+To customize the default "lightgpt-small" architecture you can adjust the `block_size`, `embedding_dimensions`, `num_hidden_layers`, and `num_attention_heads` arguments of the pre-training script. Refer to the `model_sizing.ipynb` notebook for an estimation of the memory and compute requirements for your chosen architecture.
 ```
+python pre-train.py --block_size=2048 --embedding_dimensions=4096 --num_hidden_layers=64 --num_attention_heads=64
 ```
 You can also adjust the `batch_size`, `learning_rate`, and `gradient_accumulation_steps` to suite your training setup.
 python pre-train.py --batch_size=32 --learning_rate=0.01 --gradient_accumulation_steps=128
 ```
+For distributed training, use PyTorch's [torchrun](https://pytorch.org/docs/stable/elastic/run.html) extension to launch a distributed data parallel session. The example below is for executing the training script on a single node with individual 8 GPUs.
 ```
 torchrun --standalone --nnodes=1 --nproc-per-node=8 pre-train.py --batch_size=16 --gradient_accumulation_steps=128

model.py CHANGED Viewed

@@ -34,7 +34,7 @@ class GPT(Module):
         block_size: int = 1024,
         embedding_dimensions: int = 1024,
         num_heads: int = 16,
-        num_layers: int = 24,
         dropout: float = 0.1,
         activation_checkpointing: bool = False,
         vocabulary_size: int = 50257,

         block_size: int = 1024,
         embedding_dimensions: int = 1024,
         num_heads: int = 16,
+        num_layers: int = 32,
         dropout: float = 0.1,
         activation_checkpointing: bool = False,
         vocabulary_size: int = 50257,

pre-train.py CHANGED Viewed

@@ -54,7 +54,6 @@ def main():
     parser.add_argument("--eval_interval", default=10, type=int)
     parser.add_argument("--checkpoint_interval", default=20, type=int)
     parser.add_argument("--checkpoint_path", default="./out/checkpoint.pt", type=str)
-    parser.add_argument("--checkpoint_history", action="store_true")
     parser.add_argument("--resume", action="store_true")
     parser.add_argument("--dataset_path", default="./dataset", type=str)
     parser.add_argument("--num_dataset_processes", default=8, type=int)
@@ -290,14 +289,7 @@ def main():
                 "optimizer": optimizer.state_dict(),
             }
-            if args.checkpoint_history:
-                root, ext = path.splitext(args.checkpoint_path)
-                checkpoint_path = f"{root}-{epoch}{ext}"
-            else:
-                checkpoint_path = args.checkpoint_path
-            torch.save(checkpoint, checkpoint_path)
             print("Checkpoint saved")

     parser.add_argument("--eval_interval", default=10, type=int)
     parser.add_argument("--checkpoint_interval", default=20, type=int)
     parser.add_argument("--checkpoint_path", default="./out/checkpoint.pt", type=str)
     parser.add_argument("--resume", action="store_true")
     parser.add_argument("--dataset_path", default="./dataset", type=str)
     parser.add_argument("--num_dataset_processes", default=8, type=int)
                 "optimizer": optimizer.state_dict(),
             }
+            torch.save(checkpoint, args.checkpoint_path)
             print("Checkpoint saved")