Upsampling experiment details

#8
by egor-pakhomov - opened

Question is about details of performing experiment on upsampled version of T360. What was the exact order of steps which resulted in 1.5T tokens from T360 which were later compared to Fineweb. In theory there are 2 ways to go about it:
(1) Take 5T of fully deduped T360 and upsample based on recipe. Take 1.5T from resulting 15T tokens
(2) Take 0.5T of 5T of fully deduped T360. Upsample those 0.5T with recipe to arrive at needed 1.5T.

We are reproducing some of the ablation to insure that our ablation mechanism is correct and exact match to your approach would be very beneficial.

Hi,

Thanks for noticing our work. I apologize that most of this blog post has been focusing on the processing details. We intended to write a updated, more complete version of the TxT360 experiment and release another technical report.

For you question, I believe we do (1). Just in case I am wrong, cc'ing @lynnhao and @maxma1987 to confirm.

Sign up or log in to comment