GLORT2 / README.md

Update README.md

1fe6c41 verified 5 months ago

7.2 kB

	---
	datasets:
	- EleutherAI/the_pile_deduplicated
	language:
	- en
	---

	# broken bc of updates to transformers library, let me reimplement and train

	GLORT2 (GLORT2 Low Rank Transformer Transformer) is a transformer model where every single linear layer is another smaller transformer model. I combined qkv into one operation which means one transformer instead of 3 to save on parameters, I played w using a transformer on the embeddings but it wasnt .. great, it's 768 dim 10 layers w/ 384 dim 1 layer as the replacements for linear layers (besides embed and lm head)

	also sorry I just realized theres some residual from where I copied the model code from in my own projects that includes some "expanded lm head size" stuff just ignore that if you're looking at the config and code this isn't a serious project so I don't care too much that it's there

	\| model \| 512-token strided perplexity on a pile test set \| tokens \|
	\| --- \| --- \| --- \|
	\| cerebras 111m \| 21.550655364990234 \| 2.2b \|
	\| cerebras 256m \| 15.203496932983398 \| 5.1b \|
	\| cerebras 590m \| 12.098200798034668 \| 11.something b \|
	\| deduped pythia 70m (95.6M) \| 22.393400192260742 \| 300b \|
	\| deduped pythia 160m (213M) \| 13.933751106262207 \| 300b \|
	\| deduped pythia 410m (506M) \| 9.61842155456543 \| 300b \|
	\| llama w same settings as cerebras 111m (119m) \| 13.882301330566406 \| 2.2b \|
	\| llama plus w same settings as cerebras 111m and llama 70b embeddings (369m) \| 13.565109252929688 \| 2.2b \|
	\| GLORT2 (205m) \| 13.051741600036621 \| 2.2b \|


	\| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	\|-------------\|------:\|------\|-----:\|--------\|-----:\|---\|-----:\|
	\|arc_challenge\| 1\|none \| 25\|acc \|0.1706\|± \|0.0110\|
	\| \| \|none \| 25\|acc_norm\|0.2099\|± \|0.0119\|
	\|truthfulqa_mc2\| 2\|none \| 0\|acc \|0.4599\|± \|0.0154\|
	\|winogrande\| 1\|none \| 5\|acc \|0.5083\|± \|0.0141\|
	\|hellaswag\| 1\|none \| 10\|acc \|0.2728\|± \|0.0044\|
	\| \| \|none \| 10\|acc_norm\|0.2815\|± \|0.0045\|
	\|gsm8k\| 2\|get-answer\| 5\|exact_match\| 0\|± \| 0\|


	### mmlu

	mean is 0.26394385964912276 i think

	\| Tasks \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\|
	\|-----------------------------------\|------:\|------\|-----:\|------\|-----:\|---\|-----:\|
	\|world_religions \| 0\|none \| 5\|acc \|0.1988\|± \|0.0306\|
	\|virology \| 0\|none \| 5\|acc \|0.1928\|± \|0.0307\|
	\|us_foreign_policy \| 0\|none \| 5\|acc \|0.2600\|± \|0.0441\|
	\|sociology \| 0\|none \| 5\|acc \|0.2438\|± \|0.0304\|
	\|security_studies \| 0\|none \| 5\|acc \|0.4000\|± \|0.0314\|
	\|public_relations \| 0\|none \| 5\|acc \|0.2273\|± \|0.0401\|
	\|professional_psychology \| 0\|none \| 5\|acc \|0.2484\|± \|0.0175\|
	\|professional_medicine \| 0\|none \| 5\|acc \|0.4485\|± \|0.0302\|
	\|professional_law \| 0\|none \| 5\|acc \|0.2445\|± \|0.0110\|
	\|professional_accounting \| 0\|none \| 5\|acc \|0.2482\|± \|0.0258\|
	\|prehistory \| 0\|none \| 5\|acc \|0.2562\|± \|0.0243\|
	\|philosophy \| 0\|none \| 5\|acc \|0.2186\|± \|0.0235\|
	\|nutrition \| 0\|none \| 5\|acc \|0.2941\|± \|0.0261\|
	\|moral_scenarios \| 0\|none \| 5\|acc \|0.2503\|± \|0.0145\|
	\|moral_disputes \| 0\|none \| 5\|acc \|0.1965\|± \|0.0214\|
	\|miscellaneous \| 0\|none \| 5\|acc \|0.2554\|± \|0.0156\|
	\|medical_genetics \| 0\|none \| 5\|acc \|0.3000\|± \|0.0461\|
	\|marketing \| 0\|none \| 5\|acc \|0.1966\|± \|0.0260\|
	\|management \| 0\|none \| 5\|acc \|0.1942\|± \|0.0392\|
	\|machine_learning \| 0\|none \| 5\|acc \|0.2321\|± \|0.0401\|
	\|logical_fallacies \| 0\|none \| 5\|acc \|0.2331\|± \|0.0332\|
	\|jurisprudence \| 0\|none \| 5\|acc \|0.2407\|± \|0.0413\|
	\|international_law \| 0\|none \| 5\|acc \|0.3719\|± \|0.0441\|
	\|human_sexuality \| 0\|none \| 5\|acc \|0.2137\|± \|0.0360\|
	\|human_aging \| 0\|none \| 5\|acc \|0.2646\|± \|0.0296\|
	\|high_school_world_history \| 0\|none \| 5\|acc \|0.2489\|± \|0.0281\|
	\|high_school_us_history \| 0\|none \| 5\|acc \|0.2304\|± \|0.0296\|
	\|high_school_statistics \| 0\|none \| 5\|acc \|0.4722\|± \|0.0340\|
	\|high_school_psychology \| 0\|none \| 5\|acc \|0.3083\|± \|0.0198\|
	\|high_school_physics \| 0\|none \| 5\|acc \|0.3046\|± \|0.0376\|
	\|high_school_microeconomics \| 0\|none \| 5\|acc \|0.3361\|± \|0.0307\|
	\|high_school_mathematics \| 0\|none \| 5\|acc \|0.2630\|± \|0.0268\|
	\|high_school_macroeconomics \| 0\|none \| 5\|acc \|0.3231\|± \|0.0237\|
	\|high_school_government_and_politics\| 0\|none \| 5\|acc \|0.3523\|± \|0.0345\|
	\|high_school_geography \| 0\|none \| 5\|acc \|0.3384\|± \|0.0337\|
	\|high_school_european_history \| 0\|none \| 5\|acc \|0.2909\|± \|0.0355\|
	\|high_school_computer_science \| 0\|none \| 5\|acc \|0.2600\|± \|0.0441\|
	\|high_school_chemistry \| 0\|none \| 5\|acc \|0.2709\|± \|0.0313\|
	\|high_school_biology \| 0\|none \| 5\|acc \|0.3161\|± \|0.0265\|
	\|global_facts \| 0\|none \| 5\|acc \|0.1800\|± \|0.0386\|
	\|formal_logic \| 0\|none \| 5\|acc \|0.1667\|± \|0.0333\|
	\|elementary_mathematics \| 0\|none \| 5\|acc \|0.2540\|± \|0.0224\|
	\|electrical_engineering \| 0\|none \| 5\|acc \|0.3103\|± \|0.0386\|
	\|econometrics \| 0\|none \| 5\|acc \|0.2895\|± \|0.0427\|
	\|conceptual_physics \| 0\|none \| 5\|acc \|0.2553\|± \|0.0285\|
	\|computer_security \| 0\|none \| 5\|acc \|0.1900\|± \|0.0394\|
	\|college_physics \| 0\|none \| 5\|acc \|0.3431\|± \|0.0472\|
	\|college_medicine \| 0\|none \| 5\|acc \|0.2312\|± \|0.0321\|
	\|college_mathematics \| 0\|none \| 5\|acc \|0.1800\|± \|0.0386\|
	\|college_computer_science \| 0\|none \| 5\|acc \|0.3000\|± \|0.0461\|
	\|college_chemistry \| 0\|none \| 5\|acc \|0.2900\|± \|0.0456\|
	\|college_biology \| 0\|none \| 5\|acc \|0.2083\|± \|0.0340\|
	\|clinical_knowledge \| 0\|none \| 5\|acc \|0.2038\|± \|0.0248\|
	\|business_ethics \| 0\|none \| 5\|acc \|0.2100\|± \|0.0409\|
	\|astronomy \| 0\|none \| 5\|acc \|0.1908\|± \|0.0320\|
	\|anatomy \| 0\|none \| 5\|acc \|0.2963\|± \|0.0394\|
	\|abstract_algebra \| 0\|none \| 5\|acc \|0.2000\|± \|0.0402\|

	---
	datasets:
	- EleutherAI/the_pile_deduplicated
	language:
	- en
	---

	# broken bc of updates to transformers library, let me reimplement and train

	GLORT2 (GLORT2 Low Rank Transformer Transformer) is a transformer model where every single linear layer is another smaller transformer model. I combined qkv into one operation which means one transformer instead of 3 to save on parameters, I played w using a transformer on the embeddings but it wasnt .. great, it's 768 dim 10 layers w/ 384 dim 1 layer as the replacements for linear layers (besides embed and lm head)

	also sorry I just realized theres some residual from where I copied the model code from in my own projects that includes some "expanded lm head size" stuff just ignore that if you're looking at the config and code this isn't a serious project so I don't care too much that it's there

	\| model \| 512-token strided perplexity on a pile test set \| tokens \|
	\| --- \| --- \| --- \|
	\| cerebras 111m \| 21.550655364990234 \| 2.2b \|
	\| cerebras 256m \| 15.203496932983398 \| 5.1b \|
	\| cerebras 590m \| 12.098200798034668 \| 11.something b \|
	\| deduped pythia 70m (95.6M) \| 22.393400192260742 \| 300b \|
	\| deduped pythia 160m (213M) \| 13.933751106262207 \| 300b \|
	\| deduped pythia 410m (506M) \| 9.61842155456543 \| 300b \|
	\| llama w same settings as cerebras 111m (119m) \| 13.882301330566406 \| 2.2b \|
	\| llama plus w same settings as cerebras 111m and llama 70b embeddings (369m) \| 13.565109252929688 \| 2.2b \|
	\| GLORT2 (205m) \| 13.051741600036621 \| 2.2b \|


	\| Tasks \|Version\|Filter\|n-shot\| Metric \|Value \| \|Stderr\|
	\|-------------\|------:\|------\|-----:\|--------\|-----:\|---\|-----:\|
	\|arc_challenge\| 1\|none \| 25\|acc \|0.1706\|± \|0.0110\|
	\| \| \|none \| 25\|acc_norm\|0.2099\|± \|0.0119\|
	\|truthfulqa_mc2\| 2\|none \| 0\|acc \|0.4599\|± \|0.0154\|
	\|winogrande\| 1\|none \| 5\|acc \|0.5083\|± \|0.0141\|
	\|hellaswag\| 1\|none \| 10\|acc \|0.2728\|± \|0.0044\|
	\| \| \|none \| 10\|acc_norm\|0.2815\|± \|0.0045\|
	\|gsm8k\| 2\|get-answer\| 5\|exact_match\| 0\|± \| 0\|


	### mmlu

	mean is 0.26394385964912276 i think

	\| Tasks \|Version\|Filter\|n-shot\|Metric\|Value \| \|Stderr\|
	\|-----------------------------------\|------:\|------\|-----:\|------\|-----:\|---\|-----:\|
	\|world_religions \| 0\|none \| 5\|acc \|0.1988\|± \|0.0306\|
	\|virology \| 0\|none \| 5\|acc \|0.1928\|± \|0.0307\|
	\|us_foreign_policy \| 0\|none \| 5\|acc \|0.2600\|± \|0.0441\|
	\|sociology \| 0\|none \| 5\|acc \|0.2438\|± \|0.0304\|
	\|security_studies \| 0\|none \| 5\|acc \|0.4000\|± \|0.0314\|
	\|public_relations \| 0\|none \| 5\|acc \|0.2273\|± \|0.0401\|
	\|professional_psychology \| 0\|none \| 5\|acc \|0.2484\|± \|0.0175\|
	\|professional_medicine \| 0\|none \| 5\|acc \|0.4485\|± \|0.0302\|
	\|professional_law \| 0\|none \| 5\|acc \|0.2445\|± \|0.0110\|
	\|professional_accounting \| 0\|none \| 5\|acc \|0.2482\|± \|0.0258\|
	\|prehistory \| 0\|none \| 5\|acc \|0.2562\|± \|0.0243\|
	\|philosophy \| 0\|none \| 5\|acc \|0.2186\|± \|0.0235\|
	\|nutrition \| 0\|none \| 5\|acc \|0.2941\|± \|0.0261\|
	\|moral_scenarios \| 0\|none \| 5\|acc \|0.2503\|± \|0.0145\|
	\|moral_disputes \| 0\|none \| 5\|acc \|0.1965\|± \|0.0214\|
	\|miscellaneous \| 0\|none \| 5\|acc \|0.2554\|± \|0.0156\|
	\|medical_genetics \| 0\|none \| 5\|acc \|0.3000\|± \|0.0461\|
	\|marketing \| 0\|none \| 5\|acc \|0.1966\|± \|0.0260\|
	\|management \| 0\|none \| 5\|acc \|0.1942\|± \|0.0392\|
	\|machine_learning \| 0\|none \| 5\|acc \|0.2321\|± \|0.0401\|
	\|logical_fallacies \| 0\|none \| 5\|acc \|0.2331\|± \|0.0332\|
	\|jurisprudence \| 0\|none \| 5\|acc \|0.2407\|± \|0.0413\|
	\|international_law \| 0\|none \| 5\|acc \|0.3719\|± \|0.0441\|
	\|human_sexuality \| 0\|none \| 5\|acc \|0.2137\|± \|0.0360\|
	\|human_aging \| 0\|none \| 5\|acc \|0.2646\|± \|0.0296\|
	\|high_school_world_history \| 0\|none \| 5\|acc \|0.2489\|± \|0.0281\|
	\|high_school_us_history \| 0\|none \| 5\|acc \|0.2304\|± \|0.0296\|
	\|high_school_statistics \| 0\|none \| 5\|acc \|0.4722\|± \|0.0340\|
	\|high_school_psychology \| 0\|none \| 5\|acc \|0.3083\|± \|0.0198\|
	\|high_school_physics \| 0\|none \| 5\|acc \|0.3046\|± \|0.0376\|
	\|high_school_microeconomics \| 0\|none \| 5\|acc \|0.3361\|± \|0.0307\|
	\|high_school_mathematics \| 0\|none \| 5\|acc \|0.2630\|± \|0.0268\|
	\|high_school_macroeconomics \| 0\|none \| 5\|acc \|0.3231\|± \|0.0237\|
	\|high_school_government_and_politics\| 0\|none \| 5\|acc \|0.3523\|± \|0.0345\|
	\|high_school_geography \| 0\|none \| 5\|acc \|0.3384\|± \|0.0337\|
	\|high_school_european_history \| 0\|none \| 5\|acc \|0.2909\|± \|0.0355\|
	\|high_school_computer_science \| 0\|none \| 5\|acc \|0.2600\|± \|0.0441\|
	\|high_school_chemistry \| 0\|none \| 5\|acc \|0.2709\|± \|0.0313\|
	\|high_school_biology \| 0\|none \| 5\|acc \|0.3161\|± \|0.0265\|
	\|global_facts \| 0\|none \| 5\|acc \|0.1800\|± \|0.0386\|
	\|formal_logic \| 0\|none \| 5\|acc \|0.1667\|± \|0.0333\|
	\|elementary_mathematics \| 0\|none \| 5\|acc \|0.2540\|± \|0.0224\|
	\|electrical_engineering \| 0\|none \| 5\|acc \|0.3103\|± \|0.0386\|
	\|econometrics \| 0\|none \| 5\|acc \|0.2895\|± \|0.0427\|
	\|conceptual_physics \| 0\|none \| 5\|acc \|0.2553\|± \|0.0285\|
	\|computer_security \| 0\|none \| 5\|acc \|0.1900\|± \|0.0394\|
	\|college_physics \| 0\|none \| 5\|acc \|0.3431\|± \|0.0472\|
	\|college_medicine \| 0\|none \| 5\|acc \|0.2312\|± \|0.0321\|
	\|college_mathematics \| 0\|none \| 5\|acc \|0.1800\|± \|0.0386\|
	\|college_computer_science \| 0\|none \| 5\|acc \|0.3000\|± \|0.0461\|
	\|college_chemistry \| 0\|none \| 5\|acc \|0.2900\|± \|0.0456\|
	\|college_biology \| 0\|none \| 5\|acc \|0.2083\|± \|0.0340\|
	\|clinical_knowledge \| 0\|none \| 5\|acc \|0.2038\|± \|0.0248\|
	\|business_ethics \| 0\|none \| 5\|acc \|0.2100\|± \|0.0409\|
	\|astronomy \| 0\|none \| 5\|acc \|0.1908\|± \|0.0320\|
	\|anatomy \| 0\|none \| 5\|acc \|0.2963\|± \|0.0394\|
	\|abstract_algebra \| 0\|none \| 5\|acc \|0.2000\|± \|0.0402\|