Use Fineweb instead of Openwebtext

Files changed (4) hide show

README.md +27 -12
data.py +48 -32
model_sizing.ipynb +27 -22
pre-train.py +23 -9

README.md CHANGED Viewed

@@ -14,19 +14,32 @@ tags:
 ---
 # LightGPT
-LightGPT is a lightweight generative pre-trained Transformer (GPT) model for the people! Built using pure PyTorch, LightGPT can generate text, answer questions, summarize documents, and more. A unique feature of LightGPT is that it allows you to train larger models on smaller hardware by taking advantage of memory optimizations wherever possible.
 ## Features
 - **Parameter-efficiency**: LightGPT aims to be a more parsimonious model by only training parameters that are absolutely necessary. As such, biases and positional embeddings have been completely removed from the architecture. In addition, the token embeddings and output layer share weight matrices resulting in a buy-one-get-one-free deal on trainable parameters.
-- **Low Memory Utilization**: LightGPT employs a number of training-time optimizations that conserve precious VRAM. With zero-redundancy distributed pre-training using fully-sharded data-parallel (FSDP), activation checkpointing, and automatic mixed precision, you'll be able to train larger models by accepting a relatively small amount of communication and computational overhead.
 - **Fully Open-source**: Unlike closed-source LLMs, LightGPT provides both the model weights *and* the source code to train, fine-tune, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize AI and continually improve the models.
 ## Install Project Dependencies
-Project dependencies are specified in the `requirements.txt` file. You can install them with [pip](https://pip.pypa.io/en/stable/) using the following command from the project root. I recommend using a virtual environment such as venv to keep package dependencies on your system tidy.
 ```
 python -m venv ./.venv
@@ -38,7 +51,7 @@ pip install -r requirements.txt
 ## Pre-training
-For the pre-training corpus we use the Openwebtext dataset which consists of about 9B high-quality token sequences gathered from the worldwide web. In addition, you can add as much pre-training data as you like with a custom dataloader. If you'd just like to start training right away, the default settings should work on most single-GPU systems with 12G of VRAM or more.
 ```
 python pre-train.py
@@ -88,26 +101,28 @@ Soon ...
 | Argument | Default | Type | Description |
 |---|---|---|---|
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
 | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
 | --learning_rate | 5e-4 | float | The global step size taken after every gradient accumulation step. |
 | --max_gradient_norm | 1.0 | float | Clip gradients above this threshold before stepping. |
-| --num_epochs | 2145 | int | The number of epochs to train for. |
 | --eval_interval | 10 | int | Evaluate the model after this many epochs on the testing set. |
 | --block_size | 1024 | int | The number of tokens within the context window for every sample. |
 | --embedding_dimensions | 1024 | int | The dimensionality of the token embeddings. |
 | --num_attention_heads | 16 | int | The number of attention heads within every block. |
-| --num_hidden_layers | 24 | int | The number of attention/MLP blocks within the hidden layer of the network. |
 | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
-| --activation_checkpointing | False | bool | Should we use activation checkpointing? |
-| --ddp_sharding_level | 2 | (0, 2, 3) | int | The level of sharding to use for DDP training. |
 | --checkpoint_interval | 20 | int | Save the model parameters to disk every this many epochs. |
-| --checkpoint_path | "./out/checkpoint.pt" | string | The path to the checkpoint file on disk. |
-| --dataset_path | "./dataset" | string | The path to the dataset files on disk. |
-| --num_dataset_processes | 8 | int | The number of processes (CPUs) to use to process the dataset. |
 | --resume | False | bool | Should we resume training from the last checkpoint? |
-| --device | "cuda" | string | The device to run the computation on. |
 | --seed | None | int | The seed for the random number generator. |
 ### Instruction-tuning Arguments

 ---
 # LightGPT
+LightGPT is a lightweight generative pre-trained Transformer (GPT) model for the people! Built using pure PyTorch, LightGPT can answer questions, summarize documents, chat, and more. A unique feature of LightGPT is that you can train larger models on smaller hardware by progressively enabling memory-saving features at train time such as activation checkpointing, mixed-precision, and ZeRO redundancy distributed pre-training using fully-sharded data parallel (FSDP).
 ## Features
 - **Parameter-efficiency**: LightGPT aims to be a more parsimonious model by only training parameters that are absolutely necessary. As such, biases and positional embeddings have been completely removed from the architecture. In addition, the token embeddings and output layer share weight matrices resulting in a buy-one-get-one-free deal on trainable parameters.
+- **Low Memory Utilization**: LightGPT employs a number of training-time optimizations that conserve precious GPU memory. With zero-redundancy distributed pre-training using fully-sharded data-parallel (FSDP), activation checkpointing, and automatic mixed precision, you'll be able to train larger models by accepting a relatively small amount of overhead.
 - **Fully Open-source**: Unlike closed-source LLMs, LightGPT provides both the model weights *and* the source code to train, fine-tune, and generate text from the model using your own hardware. With the help of the open-source software community, we aim to democratize AI and continually improve the models.
+## Default Configurations
+Below is a table of recommended default model training configurations but feel free to experiment with settings on your own. See the `model_sizing.ipynb` notebook to estimate the memory and compute requirements for your model configuration.
+| Name | Vocab. Size | Block Size | Embedding Dim. | Attn. Heads | Layers | Params | Train Tokens |
+|---|---|---|---|---|---|---|---|
+| Small | 50,257 | 1024 | 1024 | 16 | 32 | 454M | 10B |
+| Medium | 50,257 | 1024 | 2048 | 32 | 32 | 1.7B | 20B |
+| Large | 100,275 | 2048 | 4096 | 64 | 32 | 6.8B | 100B |
+| X-large | 100,275 | 2048 | 4096 | 64 | 64 | 13B | 350B |
+| XX-large | 200,017 | 4096 | 8192 | 128 | 64 | 53B | 1T |
+| XXX-large | 200,017 | 4096 | 8192 | 128 | 128 | 105B | 3T |
 ## Install Project Dependencies
+Project dependencies are specified in the `requirements.txt` file. You can install them with [pip](https://pip.pypa.io/en/stable/) using the following command from the project root. We recommend using a virtual environment such as `venv` to keep package dependencies on your system tidy.
 ```
 python -m venv ./.venv
 ## Pre-training
+For the pre-training corpus we use the Fineweb dataset which consists of about 15T high-quality tokens gathered from the worldwide web. The dataset has been split into 3 subsets (10BT, 100BT, and 350BT versions) for training smaller models. If you'd like to start training right away, the default settings should work on most single-GPU systems with 12G of VRAM or more.
 ```
 python pre-train.py
 | Argument | Default | Type | Description |
 |---|---|---|---|
+| --dataset_subset | "sample-10BT" | str | The subset of the Fineweb dataset to train on. Options are `sample-10BT`, `sample-100BT`, and `sample-350BT`. Set to `None` to train on the full 15T token dataset. |
+| --token_encoding | "r50k_base" | str | The encoding scheme to use when tokenizing the dataset. Options include `r50k_base`, `cl100k_base`, and `o200k_base`. |
+| --dataset_path | "./dataset" | str | The path to the preprocessed dataset files on disk. |
+| --num_dataset_processes | 8 | int | The number of processes (CPUs) to use to process the dataset. |
 | --batch_size | 1 | int | The number of samples to pass through the network at a time. |
 | --gradient_accumulation_steps | 128 | int | The number of batches to pass through the network before updating the weights. |
 | --samples_per_epoch | 4096 | int | The number of training samples to pass through the network every epoch. |
 | --learning_rate | 5e-4 | float | The global step size taken after every gradient accumulation step. |
 | --max_gradient_norm | 1.0 | float | Clip gradients above this threshold before stepping. |
+| --num_epochs | 2384 | int | The number of epochs to train for. |
 | --eval_interval | 10 | int | Evaluate the model after this many epochs on the testing set. |
 | --block_size | 1024 | int | The number of tokens within the context window for every sample. |
 | --embedding_dimensions | 1024 | int | The dimensionality of the token embeddings. |
 | --num_attention_heads | 16 | int | The number of attention heads within every block. |
+| --num_hidden_layers | 32 | int | The number of attention/MLP blocks within the hidden layer of the network. |
 | --dropout | 0.1 | float | The proportion of signals to send to zero during training as regularization. |
+| --activation_checkpointing | False | bool | Should we use activation checkpointing? This will drastically reduce memory utilization at the cost of about 30% more runtime per epoch. |
+| --ddp_sharding_level | 2 | int | The level of sharding to use for DDP training. Options are 2 or 3 for partial and full sharding respectively, or 0 for no sharding. |
 | --checkpoint_interval | 20 | int | Save the model parameters to disk every this many epochs. |
+| --checkpoint_path | "./out/checkpoint.pt" | str | The path to the checkpoint file on disk. |
 | --resume | False | bool | Should we resume training from the last checkpoint? |
+| --device | "cuda" | str | The device to run the computation on. |
 | --seed | None | int | The seed for the random number generator. |
 ### Instruction-tuning Arguments

data.py CHANGED Viewed

@@ -18,56 +18,68 @@ from torch.nn.utils.rnn import pad_sequence
 from tqdm import tqdm
-class Openwebtext(IterableDataset):
-    DATASET_NAME = "openwebtext"
-    FILE_PREFIX = DATASET_NAME
-    TRAIN_FILENAME = f"{FILE_PREFIX}-train.bin"
-    TEST_FILENAME = f"{FILE_PREFIX}-test.bin"
     TEST_SPLIT_PROPORTION = 0.005
     NUM_SHARDS = 1024
-    ENCODING = "r50k_base"
     PADDING_INDEX = -100
     def __init__(
         self,
-        root_path: str,
-        train: bool = True,
         tokens_per_sample: int = 1024,
         samples_per_epoch: int = 4096,
         num_processes: int = 8,
     ):
         super().__init__()
         if tokens_per_sample < 1:
             raise ValueError(f"Tokens per sample must be greater than 0.")
         if samples_per_epoch < 1:
             raise ValueError(f"Samples per epoch must be greater than 0.")
-        train_path = path.join(root_path, self.TRAIN_FILENAME)
-        test_path = path.join(root_path, self.TEST_FILENAME)
-        self.tokenizer = tiktoken.get_encoding(self.ENCODING)
         if not path.exists(train_path) or not path.exists(test_path):
-            tokenized_splits = (
-                load_dataset(self.DATASET_NAME, num_proc=num_processes, split="train")
-                .train_test_split(test_size=self.TEST_SPLIT_PROPORTION, shuffle=True)
-                .map(
-                    self.tokenize,
-                    desc="Tokenizing",
-                    remove_columns=["text"],
-                    num_proc=num_processes,
-                )
             )
             for split, dataset in tokenized_splits.items():
-                bin_path = path.join(root_path, f"{self.FILE_PREFIX}-{split}.bin")
                 total_length = np.sum(dataset["length"], dtype=np.uint64)
@@ -92,9 +104,7 @@ class Openwebtext(IterableDataset):
                 bin_out.flush()
-        bin_file_path = path.join(
-            root_path, self.TRAIN_FILENAME if train else self.TEST_FILENAME
-        )
         memmap = np.memmap(bin_file_path, dtype=np.uint16, mode="r")
@@ -140,8 +150,6 @@ class Openwebtext(IterableDataset):
 class Alpaca(Dataset):
     DATASET_NAME = "tatsu-lab/alpaca"
-    ENCODING = "r50k_base"
     PADDING_INDEX = -100
     PROMPT_TEMPLATE = (
@@ -162,7 +170,12 @@ class Alpaca(Dataset):
     RESPONSE_TEMPLATE = "{output}"
-    def __init__(self, max_tokens_per_sample: int = 1024, mask_input: bool = True):
         super().__init__()
         if max_tokens_per_sample < 1:
@@ -170,9 +183,12 @@ class Alpaca(Dataset):
                 f"Max tokens per sample must be greater than 0, {max_tokens_per_sample} given."
             )
-        self.dataset = load_dataset(self.DATASET_NAME, split="train")
-        self.tokenizer = tiktoken.get_encoding(self.ENCODING)
         self.max_tokens_per_sample = max_tokens_per_sample
         self.mask_input = mask_input

 from tqdm import tqdm
+class Fineweb(IterableDataset):
+    DATASET_NAME = "HuggingFaceFW/fineweb"
     TEST_SPLIT_PROPORTION = 0.005
     NUM_SHARDS = 1024
     PADDING_INDEX = -100
     def __init__(
         self,
+        root_path: str = "./dataset",
+        subset: str | None = "sample-10BT",
+        split: str = "train",
         tokens_per_sample: int = 1024,
         samples_per_epoch: int = 4096,
+        token_encoding: str = "r50k_base",
         num_processes: int = 8,
     ):
         super().__init__()
+        if subset != None:
+            if subset not in ("sample-10BT", "sample-100BT", "sample-350BT"):
+                raise ValueError(f"Invalid subset, {subset} given.")
+        if split not in ("train", "test"):
+            raise ValueError(f"Split must be either train or test, {split} given.")
         if tokens_per_sample < 1:
             raise ValueError(f"Tokens per sample must be greater than 0.")
         if samples_per_epoch < 1:
             raise ValueError(f"Samples per epoch must be greater than 0.")
+        if token_encoding not in ("r50k_base", "cl100k_base", "o200k_base"):
+            raise ValueError(f"Invalid token encoding, {token_encoding} given.")
+        self.tokenizer = tiktoken.get_encoding(token_encoding)
+        dataset_name = f"fineweb-{subset}" if subset != None else "fineweb"
+        train_path = path.join(root_path, f"{dataset_name}-train-{token_encoding}.bin")
+        test_path = path.join(root_path, f"{dataset_name}-test-{token_encoding}.bin")
         if not path.exists(train_path) or not path.exists(test_path):
+            dataset = load_dataset(
+                self.DATASET_NAME,
+                name=subset,
+                num_proc=num_processes,
+                split="train",
+            ).map(
+                self.tokenize,
+                desc="Tokenizing",
+                remove_columns=["text"],
+                num_proc=num_processes,
+            )
+            tokenized_splits = dataset.train_test_split(
+                test_size=self.TEST_SPLIT_PROPORTION
             )
             for split, dataset in tokenized_splits.items():
+                bin_path = train_path if split == "train" else test_path
                 total_length = np.sum(dataset["length"], dtype=np.uint64)
                 bin_out.flush()
+        bin_file_path = train_path if split == "train" else test_path
         memmap = np.memmap(bin_file_path, dtype=np.uint16, mode="r")
 class Alpaca(Dataset):
     DATASET_NAME = "tatsu-lab/alpaca"
     PADDING_INDEX = -100
     PROMPT_TEMPLATE = (
     RESPONSE_TEMPLATE = "{output}"
+    def __init__(
+        self,
+        max_tokens_per_sample: int = 1024,
+        token_encoding: str = "r50k_base",
+        mask_input: bool = True,
+    ):
         super().__init__()
         if max_tokens_per_sample < 1:
                 f"Max tokens per sample must be greater than 0, {max_tokens_per_sample} given."
             )
+        if token_encoding not in ("r50k_base", "cl100k_base", "o200k_base"):
+            raise ValueError(f"Invalid token encoding, {token_encoding} given.")
+        self.tokenizer = tiktoken.get_encoding(token_encoding)
+        self.dataset = load_dataset(self.DATASET_NAME, split="train")
         self.max_tokens_per_sample = max_tokens_per_sample
         self.mask_input = mask_input

model_sizing.ipynb CHANGED Viewed

@@ -9,7 +9,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 35,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -17,7 +17,9 @@
     "vocabulary_size = 50257\n",
     "embedding_dimensions = 1024\n",
     "num_attention_heads = 16\n",
-    "num_hidden_layers = 32"
    ]
   },
   {
@@ -29,7 +31,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 36,
    "metadata": {},
    "outputs": [
     {
@@ -95,14 +97,14 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 37,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Total gigabytes: 1.82\n"
      ]
     }
    ],
@@ -113,7 +115,7 @@
     "\n",
     "total_gigabytes = total_bytes / 1e9\n",
     "\n",
-    "print(f\"Total gigabytes: {total_gigabytes:,.2f}\")"
    ]
   },
   {
@@ -125,7 +127,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 54,
    "metadata": {},
    "outputs": [
     {
@@ -220,7 +222,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 55,
    "metadata": {},
    "outputs": [
     {
@@ -246,7 +248,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 56,
    "metadata": {},
    "outputs": [
     {
@@ -272,7 +274,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 65,
    "metadata": {},
    "outputs": [
     {
@@ -300,7 +302,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 96,
    "metadata": {},
    "outputs": [
     {
@@ -308,8 +310,10 @@
      "output_type": "stream",
      "text": [
       "RTX A2000 MFU: 17.29%\n",
       "RTX 3090 MFU: 22.99%\n",
-      "A100 SXM MFU: 37.16%\n"
      ]
     }
    ],
@@ -332,8 +336,10 @@
     "\n",
     "devices = [\n",
     "    Device(\"RTX A2000\", 63.9e12, 3.45 * total_roundtrip_flops),\n",
     "    Device(\"RTX 3090\", 285.5e12, 20.5 * total_roundtrip_flops),\n",
     "    Device(\"A100 SXM\", 624.0e12, 72.4 * total_roundtrip_flops),\n",
     "]\n",
     "\n",
     "for device in devices:\n",
@@ -344,31 +350,30 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Now, let's estimate how long it would take to train over every sample in the Openwebtext training set at least once in expectation. Note that these results shown here are a best-case scenario and neglect to factor in other overhead."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 97,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
-      "Total tokens: 8,994,885,755\n",
-      "Epochs required: 2,145\n",
       "\n",
-      "RTX A2000: 1187.25 seconds/epoch, 29.48 days required\n",
-      "RTX 3090: 199.80 seconds/epoch, 4.96 days required\n",
-      "A100 SXM: 56.57 seconds/epoch, 1.40 days required\n"
      ]
     }
    ],
    "source": [
-    "num_training_tokens = 8994885755\n",
-    "samples_per_epoch = 4096\n",
-    "\n",
     "num_epochs_required = round(num_training_tokens / (samples_per_epoch * block_size))\n",
     "\n",
     "print(f\"Total tokens: {num_training_tokens:,}\")\n",

   },
   {
    "cell_type": "code",
+   "execution_count": 252,
    "metadata": {},
    "outputs": [],
    "source": [
     "vocabulary_size = 50257\n",
     "embedding_dimensions = 1024\n",
     "num_attention_heads = 16\n",
+    "num_hidden_layers = 32\n",
+    "num_training_tokens = 10e9\n",
+    "samples_per_epoch = 4096"
    ]
   },
   {
   },
   {
    "cell_type": "code",
+   "execution_count": 253,
    "metadata": {},
    "outputs": [
     {
   },
   {
    "cell_type": "code",
+   "execution_count": 254,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "Total gigabytes: 1.82G\n"
      ]
     }
    ],
     "\n",
     "total_gigabytes = total_bytes / 1e9\n",
     "\n",
+    "print(f\"Total gigabytes: {total_gigabytes:,.2f}G\")"
    ]
   },
   {
   },
   {
    "cell_type": "code",
+   "execution_count": 255,
    "metadata": {},
    "outputs": [
     {
   },
   {
    "cell_type": "code",
+   "execution_count": 256,
    "metadata": {},
    "outputs": [
     {
   },
   {
    "cell_type": "code",
+   "execution_count": 257,
    "metadata": {},
    "outputs": [
     {
   },
   {
    "cell_type": "code",
+   "execution_count": 258,
    "metadata": {},
    "outputs": [
     {
   },
   {
    "cell_type": "code",
+   "execution_count": 259,
    "metadata": {},
    "outputs": [
     {
      "output_type": "stream",
      "text": [
       "RTX A2000 MFU: 17.29%\n",
+      "RTX A4000 MFU: 19.00%\n",
       "RTX 3090 MFU: 22.99%\n",
+      "A100 SXM MFU: 37.16%\n",
+      "HGX A100 MFU: 37.16%\n"
      ]
     }
    ],
     "\n",
     "devices = [\n",
     "    Device(\"RTX A2000\", 63.9e12, 3.45 * total_roundtrip_flops),\n",
+    "    Device(\"RTX A4000\", 153.4e12, 9.1 * total_roundtrip_flops),\n",
     "    Device(\"RTX 3090\", 285.5e12, 20.5 * total_roundtrip_flops),\n",
     "    Device(\"A100 SXM\", 624.0e12, 72.4 * total_roundtrip_flops),\n",
+    "    Device(\"HGX A100\", 4992e12, 579.2 * total_roundtrip_flops),\n",
     "]\n",
     "\n",
     "for device in devices:\n",
    "cell_type": "markdown",
    "metadata": {},
    "source": [
+    "Finally, let's estimate how long it would take to train over every sample in the Openwebtext training set at least once in expectation. Note that these results shown here are a theoretical scenario and do not factor in additional overhead such as activation checkpointing or network latency."
    ]
   },
   {
    "cell_type": "code",
+   "execution_count": 260,
    "metadata": {},
    "outputs": [
     {
      "name": "stdout",
      "output_type": "stream",
      "text": [
+      "Total tokens: 10,000,000,000.0\n",
+      "Epochs required: 2,384\n",
       "\n",
+      "RTX A2000: 1187.25 seconds/epoch, 32.76 days required\n",
+      "RTX A4000: 450.11 seconds/epoch, 12.42 days required\n",
+      "RTX 3090: 199.80 seconds/epoch, 5.51 days required\n",
+      "A100 SXM: 56.57 seconds/epoch, 1.56 days required\n",
+      "HGX A100: 7.07 seconds/epoch, 0.20 days required\n"
      ]
     }
    ],
    "source": [
     "num_epochs_required = round(num_training_tokens / (samples_per_epoch * block_size))\n",
     "\n",
     "print(f\"Total tokens: {num_training_tokens:,}\")\n",

pre-train.py CHANGED Viewed

@@ -20,7 +20,7 @@ from torch.distributed.fsdp import FullyShardedDataParallel, ShardingStrategy
 from torchmetrics.text import Perplexity
 from model import GPT
-from data import Openwebtext
 from tqdm import tqdm
@@ -38,25 +38,35 @@ DDP_BACKEND = "nccl"
 def main():
     parser = ArgumentParser(description="Pre-train the GPT.")
     parser.add_argument("--batch_size", default=1, type=int)
     parser.add_argument("--gradient_accumulation_steps", default=128, type=int)
     parser.add_argument("--samples_per_epoch", default=4096, type=int)
     parser.add_argument("--learning_rate", default=1e-2, type=float)
     parser.add_argument("--max_gradient_norm", default=1.0, type=float)
     parser.add_argument("--dropout", default=0.1, type=float)
-    parser.add_argument("--num_epochs", default=2140, type=int)
     parser.add_argument("--block_size", default=1024, type=int)
     parser.add_argument("--embedding_dimensions", default=1024, type=int)
     parser.add_argument("--num_attention_heads", default=16, type=int)
     parser.add_argument("--num_hidden_layers", default=32, type=int)
     parser.add_argument("--activation_checkpointing", action="store_true")
-    parser.add_argument("--ddp_sharding_level", default=2, choices=[0, 2, 3])
     parser.add_argument("--eval_interval", default=10, type=int)
     parser.add_argument("--checkpoint_interval", default=20, type=int)
     parser.add_argument("--checkpoint_path", default="./out/checkpoint.pt", type=str)
     parser.add_argument("--resume", action="store_true")
-    parser.add_argument("--dataset_path", default="./dataset", type=str)
-    parser.add_argument("--num_dataset_processes", default=8, type=int)
     parser.add_argument("--device", default="cuda", type=str)
     parser.add_argument("--seed", default=None, type=int)
@@ -139,18 +149,22 @@ def main():
         torch.manual_seed(args.seed)
         random.seed(args.seed)
-    training = Openwebtext(
         root_path=args.dataset_path,
-        train=True,
         tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
         num_processes=args.num_dataset_processes,
     )
-    testing = Openwebtext(
         root_path=args.dataset_path,
-        train=False,
         tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
         num_processes=args.num_dataset_processes,
     )

 from torchmetrics.text import Perplexity
 from model import GPT
+from data import Fineweb
 from tqdm import tqdm
 def main():
     parser = ArgumentParser(description="Pre-train the GPT.")
+    parser.add_argument(
+        "--dataset_subset",
+        default="sample-10BT",
+        choices=("sample-10BT", "sample-100BT", "sample-350BT", None),
+    )
+    parser.add_argument(
+        "--token_encoding",
+        default="r50k_base",
+        choices=("r50k_base", "cl100k_base", "o200k_base"),
+    )
+    parser.add_argument("--dataset_path", default="./dataset", type=str)
+    parser.add_argument("--num_dataset_processes", default=8, type=int)
     parser.add_argument("--batch_size", default=1, type=int)
     parser.add_argument("--gradient_accumulation_steps", default=128, type=int)
     parser.add_argument("--samples_per_epoch", default=4096, type=int)
     parser.add_argument("--learning_rate", default=1e-2, type=float)
     parser.add_argument("--max_gradient_norm", default=1.0, type=float)
     parser.add_argument("--dropout", default=0.1, type=float)
+    parser.add_argument("--num_epochs", default=2384, type=int)
     parser.add_argument("--block_size", default=1024, type=int)
     parser.add_argument("--embedding_dimensions", default=1024, type=int)
     parser.add_argument("--num_attention_heads", default=16, type=int)
     parser.add_argument("--num_hidden_layers", default=32, type=int)
     parser.add_argument("--activation_checkpointing", action="store_true")
+    parser.add_argument("--ddp_sharding_level", default=2, choices=(0, 2, 3))
     parser.add_argument("--eval_interval", default=10, type=int)
     parser.add_argument("--checkpoint_interval", default=20, type=int)
     parser.add_argument("--checkpoint_path", default="./out/checkpoint.pt", type=str)
     parser.add_argument("--resume", action="store_true")
     parser.add_argument("--device", default="cuda", type=str)
     parser.add_argument("--seed", default=None, type=int)
         torch.manual_seed(args.seed)
         random.seed(args.seed)
+    training = Fineweb(
         root_path=args.dataset_path,
+        subset=args.dataset_subset,
+        split="train",
         tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
+        token_encoding=args.token_encoding,
         num_processes=args.num_dataset_processes,
     )
+    testing = Fineweb(
         root_path=args.dataset_path,
+        subset=args.dataset_subset,
+        split="test",
         tokens_per_sample=args.block_size,
         samples_per_epoch=args.samples_per_epoch,
+        token_encoding=args.token_encoding,
         num_processes=args.num_dataset_processes,
     )