Papers
arxiv:2412.13378

SummExecEdit: A Factual Consistency Benchmark in Summarization with Executable Edits

Published on Dec 17, 2024
Authors:
,
,

Abstract

Detecting factual inconsistencies in summarization is critical, yet existing benchmarks lack the necessary challenge and interpretability for robust evaluation. In this paper, we introduce SummExecEdit, a novel benchmark leveraging executable edits to assess models on their ability to both detect factual errors and provide accurate explanations. The top-performing model, Claude3-Opus, achieves a joint <PRE_TAG>detection and explanation score</POST_TAG> of only 0.49 in our benchmark, with individual scores of 0.67 for detection and 0.73 for explanation. Furthermore, we identify four primary types of explanation errors, with 45.4% of errors focusing on completely unrelated parts of the summary.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.13378 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.13378 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.