True, but it seems like there’s nothing to be evaluated as of right now. I assume the ultimate goal is to train a new reasoning model and then use the same evaluation metrics as o1 and the DeepSeek-R1.
Well, there should be at least some sanity check and validation to ensure the model was trained correctly.