How Do LLMs Acquire New Knowledge? A Knowledge Circuits Perspective on Continual Pre-Training
Paper
•
2502.11196
•
Published
•
21
Thanks for your effort in energy efficiency. You worked up my curiosity!
Why do smolLM-135m and smolLm-1.7B nearly have the same score besides a 10 times model size difference? Does the identical context size mostly cause it?
Could you please enable encoder-decoder models? They should be in theory more efficient because the input has to be encoded only once and can be reused in every decoding step.
Good write-up, though it is missing the dominant attention sink in current decoder-only models:
https://colab.research.google.com/drive/1Fcgug4a6rv9F-Wej0rNveiM_SMNZOtrr?usp=sharing