Question on the training epoch

#1
by Tomohide - opened

Thank you for releasing this great model.
I have one question.

The "Training" section says that "The model was trained on around 312.5B tokens from Japanese CC-100, Japanese C4, and Japanese Wikipedia.." .
I think the total number of tokens in these corpora is about 180B, and so this statement means the training epoch is 1.73 epochs (= 312.5 / 180)?

Thank you in advance.

@Tomohide You are welcome.
Since data processing, filtering, and resampling have been applied to the training data, the exact token number might not match your assumption.
But I believe the final dataset token number is not too different from 180B, so the estimation of 1.73 epochs should be close enough.

@tianyuz Thank you for your response!

Tomohide changed discussion status to closed

Sign up or log in to comment