Liger: Linearizing Large Language Models to Gated Recurrent Structures Paper • 2503.01496 • Published 3 days ago • 14
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs Paper • 2503.01307 • Published 4 days ago • 27
Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models Paper • 2502.15499 • Published 13 days ago • 13
Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam Paper • 2502.17055 • Published 11 days ago • 16
AlphaMaze: Enhancing Large Language Models' Spatial Intelligence via GRPO Paper • 2502.14669 • Published 14 days ago • 11
Revisiting the Test-Time Scaling of o1-like Models: Do they Truly Possess Test-Time Scaling Capabilities? Paper • 2502.12215 • Published 18 days ago • 16
You Do Not Fully Utilize Transformer's Representation Capacity Paper • 2502.09245 • Published 22 days ago • 34
Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity Paper • 2502.13063 • Published 16 days ago • 65
REALTALK: A 21-Day Real-World Dataset for Long-Term Conversation Paper • 2502.13270 • Published 16 days ago • 6
LongPO: Long Context Self-Evolution of Large Language Models through Short-to-Long Preference Optimization Paper • 2502.13922 • Published 15 days ago • 25
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering Paper • 2502.13962 • Published 15 days ago • 28
MoM: Linear Sequence Modeling with Mixture-of-Memories Paper • 2502.13685 • Published 15 days ago • 33