Question about comment-to-code Ratio

#54

by feiyulv - opened Jun 17, 2023

Jun 17, 2023

Hi, in the paper of santcoder, we metioned that final training using near-dedup + comment-to-ratio can acheive the best result. However comment-to-ratio part is missing int the starcoder paper. Do we still use the comment-to-ratio strategy to preprocess dataset in StarCoder?

loubnabnl

BigCode org Jun 19, 2023

•

edited Jun 19, 2023

Hi, as shown in the paper, this gives a small boost in performance but not significant enough compared to near deduplication so we decided it wasn't worth adapting it to 80+ programming languages and only did aggressive near deduplication.

feiyulv

Jun 19, 2023

Thank you

feiyulv changed discussion status to closed Jun 19, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment