', unsafe_allow_html=True)
st.markdown('
Benchmark Details
',
unsafe_allow_html=True)
st.image(image, use_column_width=True)
st.markdown('### VERIFY: A Pipeline for Factuality Evaluation')
st.write(
"Language models (LMs) are widely used by an increasing number of users, "
"underscoring the challenge of maintaining factual accuracy across a broad range of topics. "
"We present VERIFY (Verification and Evidence Retrieval for Factuality evaluation), "
"a pipeline to evaluate LMs' factual accuracy in real-world user interactions."
)
st.markdown('### Content Categorization')
st.write(
"VERIFY considers the verifiability of LM-generated content and categorizes content units as "
"`supported`, `unsupported`, or `undecidable` based on the retrieved web evidence. "
"Importantly, VERIFY's factuality judgments correlate better with human evaluations than existing methods."
)
st.markdown('### Hallucination Prompts & FactBench Dataset')
st.write(
"Using VERIFY, we identify 'hallucination prompts' across diverse topics—those eliciting the highest rates of "
"incorrect or unverifiable LM responses. These prompts form FactBench, a dataset of 985 prompts across 213 "
"fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and is "
"regularly updated with new prompts."
)
st.markdown('
', unsafe_allow_html=True)
# Tab 3: Links
with tab3:
st.markdown('', unsafe_allow_html=True)
st.markdown('
Submit your model information on our Github
',
unsafe_allow_html=True)
st.markdown(
'[Test your model locally!](https://github.com/FarimaFatahi/FactEval)')
st.markdown(
'[Submit results or issues!](https://github.com/FarimaFatahi/FactEval/issues/new)')
st.markdown('
', unsafe_allow_html=True)