sugatoray
's Collections
Papers-Benchmarks
updated
CS-Bench: A Comprehensive Benchmark for Large Language Models towards
Computer Science Mastery
Paper
•
2406.08587
•
Published
•
15
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning
Paper
•
2406.09170
•
Published
•
26
AppWorld: A Controllable World of Apps and People for Benchmarking
Interactive Coding Agents
Paper
•
2407.18901
•
Published
•
33
Benchmarking Agentic Workflow Generation
Paper
•
2410.07869
•
Published
•
25
OmniDocBench: Benchmarking Diverse PDF Document Parsing with
Comprehensive Annotations
Paper
•
2412.07626
•
Published
•
22
Viewer
•
Updated
•
984
•
1.53k
•
19
🥇
OmniEval
RUC-NLPIR/OmniEval-AutoGen-Dataset
Updated
•
57
•
2
m-ric/agents_medium_benchmark_2
Viewer
•
Updated
•
142
•
291
•
7
Viewer
•
Updated
•
932
•
673
•
178
GAIA: a benchmark for General AI Assistants
Paper
•
2311.12983
•
Published
•
188
m-ric/agents_small_benchmark
Viewer
•
Updated
•
100
•
68
•
9