arxiv:2410.05838

Time Transfer: On Optimal Learning Rate and Batch Size In The Infinite Data Limit

Published on Oct 8, 2024

Authors:

Oleg Filatov ,

Abstract

One of the main challenges in optimal scaling of large language models (LLMs) is the prohibitive cost of hyperparameter tuning, particularly learning rate eta and batch size B. While techniques like muP (Yang et al., 2022) provide scaling rules for optimal eta transfer in the infinite model size limit, the optimal scaling behavior in the infinite data size limit remains unknown. We fill in this gap by observing for the first time an intricate dependence of optimal eta scaling on the pretraining token budget T, B and its relation to the critical batch size B_crit, which we measure to evolve as B_crit propto T. Furthermore, we show that the optimal batch size is positively correlated with B_crit: keeping it fixed becomes suboptimal over time even if learning rate is scaled optimally. Surprisingly, our results demonstrate that the observed optimal eta and B dynamics are preserved with muP model scaling, challenging the conventional view of B_crit dependence solely on loss value. Complementing optimality, we examine the sensitivity of loss to changes in learning rate, where we find the sensitivity to decrease with increase of T and to remain constant with muP model scaling. We hope our results make the first step towards a unified picture of the joint optimal data and model scaling.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2410.05838 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2410.05838 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2410.05838 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.