Question about comment-to-code Ratio
- opened
Hi, in the paper of santcoder, we metioned that final training using near-dedup + comment-to-ratio can acheive the best result. However comment-to-ratio part is missing int the starcoder paper. Do we still use the comment-to-ratio strategy to preprocess dataset in StarCoder?
Hi, as shown in the paper, this gives a small boost in performance but not significant enough compared to near deduplication so we decided it wasn't worth adapting it to 80+ programming languages and only did aggressive near deduplication.
Thank you
changed discussion status to