--- license: apache-2.0 datasets: - JeanKaddour/minipile language: - en pipeline_tag: text-generation --- # megalodon-200m: minipile Small pretraining experiment: - 8192 ctx, approx 1 epoch - codebase: https://github.com/pszemraj/megalodon/tree/dataload-fixes - [training logs](https://huggingface.co./pszemraj/megalodon-200m-minipile/raw/main/train.log) ### Model Configuration - **Number of Layers:** 12 - **Model Dimension:** 1024 - **Z Dimension:** 256 - **Value Dimension:** 2048 - **Number of Heads:** 1 - **FFN Hidden Dimension:** 2560 - **CEMA NDIM:** 16 - **Chunk Size:** 2048 - **Efficient Attention:** None - **Initialization Mode:** He - **Vocabulary Size:** 20480 - **Output Size:** 20480 - **Normalization Groups:** 32 - **Normalization Affine:** True - **Normalization Epsilon:** 1e-05 - **ROPE Base:** None - **Dropout:** 0.0 - **Hidden Dropout:** 0.0 - **Attention Dropout:** 0.0 - **SWIGLU:** False - **Rescale NFFN:** False - **Scale Embedding:** False - **Share Embedding:** False - **Layerwise Checkpointing:** False