ONEBench to Test Them All: Sample-Level Benchmarking Over Open-Ended Capabilities Paper • 2412.06745 • Published 16 days ago • 6
Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks? Paper • 2411.05000 • Published Nov 7 • 21
On scalable oversight with weak LLMs judging strong LLMs Paper • 2407.04622 • Published Jul 5 • 11
InstructVideo: Instructing Video Diffusion Models with Human Feedback Paper • 2312.12490 • Published Dec 19, 2023 • 17
arXiVeri: Automatic table verification with GPT Paper • 2306.07968 • Published Jun 13, 2023 • 6
Crosslingual Generalization through Multitask Finetuning Paper • 2211.01786 • Published Nov 3, 2022 • 2