How did you estimate model parameter count before training and use correct hyperparams?
As title says π€.
It's pretty simple if you look at config.json in this repo, as using the information inside of it you can actually calculate the parameter count yourself.
Starting with the embedding layers, both the token and position embeddings are sized at 192 dimensions, applied to a vocabulary of 32,000 tokens. The calculation for each is straightforward: 32,000 * 192 = 6,144,000 parameters. Together, they total about 12.29 million parameters for both the token and position embeddings as they are the same size in this model.
The attention mechanism in this model uses 2 heads, each with dimensions split into 96 for queries, keys, and values in a 192-dimensional embedding space. The parameters for the attention mechanism are calculated by 2 heads * 3 matrices per head (Q,K,V) * 192 embedding dimensions * 96 dimensions per head = 110,592 parameters for the matrices. The output projection also has it's own parameters, which are calculated as 192 positional embeddings * 192 which is 36,864 parameters, giving a total of 147,456 parameters for the entire multi-headed attention mechanism.
For the feed-forward network, which uses a width of 1024 for the intermediate layer, each linear transformation contributes significantly. The first transformation upscales from 192 to 1024, calculated by 192 * 1,024 = 196,608 and the second downscales back to 192, using the same number of parameters, 1,024 * 192 = 196,608 Together, these add up to 393,216 for the layer.
Layer normalization, despite being a smaller contributor to the total, adds 2 * 192 = 384 parameters, due to scaling and centering factors. When all these components are summed, the total parameter count comes out to approximately 12.83 million parameters (A bit above the 10M parameters, but it's within my margin of "error").
Important parts from config.json for reference:
{
. . .
"hidden_size": 192,
"intermediate_size": 1024,
"max_position_embeddings": 1024,
"num_attention_heads": 2,
"num_hidden_layers": 1,
"num_key_value_heads": 1,
. . .
"vocab_size": 32000
}
Hope this answers your question, you can apply this to all of the other OpenLAiNN Models. π
It's pretty simple if you look at config.json in this repo, as using the information inside of it you can actually calculate the parameter count yourself.
Starting with the embedding layers, both the token and position embeddings are sized at 192 dimensions, applied to a vocabulary of 32,000 tokens. The calculation for each is straightforward: 32,000 * 192 = 6,144,000 parameters. Together, they total about 12.29 million parameters for both the token and position embeddings as they are the same size in this model.
The attention mechanism in this model uses 2 heads, each with dimensions split into 96 for queries, keys, and values in a 192-dimensional embedding space. The parameters for the attention mechanism are calculated by 2 heads * 3 matrices per head (Q,K,V) * 192 embedding dimensions * 96 dimensions per head = 110,592 parameters for the matrices. The output projection also has it's own parameters, which are calculated as 192 positional embeddings * 192 which is 36,864 parameters, giving a total of 147,456 parameters for the entire multi-headed attention mechanism.
For the feed-forward network, which uses a width of 1024 for the intermediate layer, each linear transformation contributes significantly. The first transformation upscales from 192 to 1024, calculated by 192 * 1,024 = 196,608 and the second downscales back to 192, using the same number of parameters, 1,024 * 192 = 196,608 Together, these add up to 393,216 for the layer.
Layer normalization, despite being a smaller contributor to the total, adds 2 * 192 = 384 parameters, due to scaling and centering factors. When all these components are summed, the total parameter count comes out to approximately 12.83 million parameters (A bit above the 10M parameters, but it's within my margin of "error").Important parts from config.json for reference:
{
. . .
"hidden_size": 192,
"intermediate_size": 1024,
"max_position_embeddings": 1024,
"num_attention_heads": 2,
"num_hidden_layers": 1,
"num_key_value_heads": 1,
. . .
"vocab_size": 32000
}Hope this answers your question, you can apply this to all of the other OpenLAiNN Models. π
Thank you so much! I pretrained a couple of models myself, but never got the param count right.
Are you thinking about open sourcing the training code? Thank you again.
I'm debating it, As I'm working on multiple projects right now, The Training code will probably be open sourced once I clean it up and make it not awful to look at and read. Though this will probably be around OpenLAiNN-2, as I am working on a second version with a different arch that might be interesting. No promises on that one though. :)
As of right now, I'm also working on creating some instruct tuned versions of these models.