MiniMax-01: Scaling Foundation Models with Lightning Attention Paper • 2501.08313 • Published 26 days ago • 273
Retrofitting (Large) Language Models with Dynamic Tokenization Paper • 2411.18553 • Published Nov 27, 2024 • 2
Byte Latent Transformer: Patches Scale Better Than Tokens Paper • 2412.09871 • Published Dec 13, 2024 • 91
ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning Paper • 2502.01100 • Published 6 days ago • 14
The Differences Between Direct Alignment Algorithms are a Blur Paper • 2502.01237 • Published 6 days ago • 108
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning Paper • 2502.03275 • Published 4 days ago • 11
BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation Paper • 2502.03860 • Published 3 days ago • 14
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps Paper • 2501.09732 • Published 24 days ago • 67
Pairwise RM: Perform Best-of-N Sampling with Knockout Tournament Paper • 2501.13007 • Published 18 days ago • 19
Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback Paper • 2501.12895 • Published 18 days ago • 55
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Paper • 2501.12948 • Published 18 days ago • 305
Sigma: Differential Rescaling of Query, Key and Value for Efficient Language Models Paper • 2501.13629 • Published 17 days ago • 43
ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer Paper • 2501.15570 • Published 14 days ago • 23
SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training Paper • 2501.17161 • Published 12 days ago • 101