Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization Paper • 2307.11007 • Published Jul 20, 2023
RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval Paper • 2402.18510 • Published Feb 28, 2024
Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models Paper • 2501.11873 • Published 19 days ago • 63
Skywork-Math: Data Scaling Laws for Mathematical Reasoning in Large Language Models -- The Story Goes On Paper • 2407.08348 • Published Jul 11, 2024 • 51