', unsafe_allow_html=True) # Dropdown menu to filter tiers tiers = ['All Tiers', 'Tier 1: Easy', 'Tier 2: Moderate', 'Tier 3: Hard'] selected_tier = st.selectbox('Select Tier:', tiers) # Filter the data based on the selected tier if selected_tier != 'All Tiers': filtered_df = df[df['Tier'] == selected_tier] else: filtered_df = df # Create HTML for the table html = ''' ''' # Generate the rows of the table current_tier = None for i, row in filtered_df.iterrows(): if row['Tier'] != current_tier: if current_tier is not None: # Close the previous tier row html += ' ' current_tier = row['Tier'] html += f' ' else: html += ' ' # Fill in model and scores html += f''' ''' # Close the last row and table tags html += '''

Tier	Model	FactScore	SAFE	Factcheck-GPT	VERIFY
{current_tier}
	{row['Model']}	{row['FactScore']:.2f}	{row['SAFE']:.2f}	{row['Factcheck-GPT']:.2f}	{row['VERIFY']:.2f}

''' # Display the table st.markdown(html, unsafe_allow_html=True) st.markdown('

', unsafe_allow_html=True) st.markdown('

Benchmark Details

', unsafe_allow_html=True) st.image(image, use_column_width=True) st.markdown('### VERIFY: A Pipeline for Factuality Evaluation') st.write( "Language models (LMs) are widely used by an increasing number of users, " "underscoring the challenge of maintaining factual accuracy across a broad range of topics. " "We present VERIFY (Verification and Evidence Retrieval for Factuality evaluation), " "a pipeline to evaluate LMs' factual accuracy in real-world user interactions." ) st.markdown('### Content Categorization') st.write( "VERIFY considers the verifiability of LM-generated content and categorizes content units as " "`supported`, `unsupported`, or `undecidable` based on the retrieved web evidence. " "Importantly, VERIFY's factuality judgments correlate better with human evaluations than existing methods." ) st.markdown('### Hallucination Prompts & FactBench Dataset') st.write( "Using VERIFY, we identify 'hallucination prompts' across diverse topics—those eliciting the highest rates of " "incorrect or unverifiable LM responses. These prompts form FactBench, a dataset of 985 prompts across 213 " "fine-grained topics. Our dataset captures emerging factuality challenges in real-world LM interactions and is " "regularly updated with new prompts." ) st.markdown('

', unsafe_allow_html=True) st.markdown('

Submit your model information on our Github

', unsafe_allow_html=True) st.markdown( '[Test your model locally!](https://github.com/FarimaFatahi/FactEval)') st.markdown( '[Submit results or issues!](https://github.com/FarimaFatahi/FactEval/issues/new)') st.markdown('