Papers-Benchmarks - a sugatoray Collection

sugatoray 's Collections

Books And Notes

Reasoning Datasets

SmolAgents Tools (Spaces)

Bookmark::Models

LLMs

AV LLMs

LLM Training Datasets

Papers

Leaderboards 🔥

Papers-Fundamentals

TFM: TimeSeries Foundation Models

Papers-Benchmarks

LLMs-EmbeddingModels

LLM + Datasets : Finance

Papers-Benchmarks

updated 2 days ago

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery

Paper • 2406.08587 • Published Jun 12, 2024 • 16
Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Paper • 2406.09170 • Published Jun 13, 2024 • 27
AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents

Paper • 2407.18901 • Published Jul 26, 2024 • 33
Benchmarking Agentic Workflow Generation

Paper • 2410.07869 • Published Oct 10, 2024 • 26
OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations

Paper • 2412.07626 • Published Dec 10, 2024 • 22
opendatalab/OmniDocBench

Viewer • Updated 18 days ago • 984 • 2.6k • 21
Sleeping

4

4

OmniEval

🥇
RUC-NLPIR/OmniEval-AutoGen-Dataset

Updated Dec 19, 2024 • 23 • 2
m-ric/agents_medium_benchmark_2

Viewer • Updated Dec 27, 2024 • 142 • 206 • 9
gaia-benchmark/GAIA

Updated 16 days ago • 9.34k • 218
GAIA: a benchmark for General AI Assistants

Paper • 2311.12983 • Published Nov 21, 2023 • 192
m-ric/agents_small_benchmark

Viewer • Updated Jan 19, 2024 • 100 • 132 • 10
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Paper • 2502.09560 • Published 15 days ago • 32
m-a-p/CodeCriticBench

Viewer • Updated 4 days ago • 4.3k • 69 • 2