Spaces:
Runtime error
Runtime error
from ctypes import DEFAULT_MODE | |
import streamlit as st | |
from transformers import AutoModelForSequenceClassification, AutoTokenizer, AutoConfig | |
from ferret import Benchmark | |
from torch.nn.functional import softmax | |
from copy import deepcopy | |
DEFAULT_MODEL = "Hate-speech-CNERG/bert-base-uncased-hatexplain" | |
DEFAULT_SAMPLES = "3,5,8,13,15,17,18,25,27,28" | |
def get_model(model_name): | |
return AutoModelForSequenceClassification.from_pretrained(model_name) | |
def get_config(model_name): | |
return AutoConfig.from_pretrained(model_name) | |
def get_tokenizer(tokenizer_name): | |
return AutoTokenizer.from_pretrained(tokenizer_name, use_fast=True) | |
def body(): | |
st.title("Evaluate explanations on dataset samples") | |
st.markdown( | |
""" | |
Let's test how our built-in explainers behave on state-of-the-art datasets for explanability. | |
*ferret* exposes an extensible Dataset API. We currently implement [MovieReviews](https://huggingface.co./datasets/movie_rationales) and [HateXPlain](https://huggingface.co./datasets/hatexplain). | |
In this demo, you let you experiment with HateXPlain. | |
You just need to choose a prediction model and a set of samples to test. | |
We will trigger *ferret* to: | |
1. download the model; | |
2. explain every sample you did choose; | |
3. average all faithfulness and plausibility metrics we support 📊 | |
""" | |
) | |
col1, col2 = st.columns([3, 1]) | |
with col1: | |
model_name = st.text_input("HF Model", DEFAULT_MODEL) | |
config = AutoConfig.from_pretrained(model_name) | |
with col2: | |
class_labels = list(config.id2label.values()) | |
target = st.selectbox( | |
"Target", | |
options=class_labels, | |
index=0, | |
help="Class label you want to explain.", | |
) | |
samples_string = st.text_input( | |
"List of samples", | |
DEFAULT_SAMPLES, | |
help="List of indices in the dataset, comma-separated.", | |
) | |
compute = st.button("Run") | |
samples = list(map(int, samples_string.replace(" ", "").split(","))) | |
if compute and model_name: | |
with st.spinner("Preparing the magic. Hang in there..."): | |
model = get_model(model_name) | |
tokenizer = get_tokenizer(model_name) | |
bench = Benchmark(model, tokenizer) | |
with st.spinner("Explaining sample (this might take a while)..."): | |
def compute_table(samples): | |
data = bench.load_dataset("hatexplain") | |
sample_evaluations = bench.evaluate_samples( | |
data, samples, target=class_labels.index(target) | |
) | |
table = bench.show_samples_evaluation_table(sample_evaluations).format( | |
"{:.2f}" | |
) | |
return table | |
table = compute_table(samples) | |
st.markdown("### Averaged metrics") | |
st.dataframe(table) | |
st.caption("Darker colors mean better performance.") | |
# scores = bench.score(text) | |
# scores_str = ", ".join( | |
# [f"{config.id2label[l]}: {s:.2f}" for l, s in enumerate(scores)] | |
# ) | |
# st.text(scores_str) | |
# with st.spinner("Computing Explanations.."): | |
# explanations = bench.explain(text, target=class_labels.index(target)) | |
# st.markdown("### Explanations") | |
# st.dataframe(bench.show_table(explanations)) | |
# st.caption("Darker red (blue) means higher (lower) contribution.") | |
# with st.spinner("Evaluating Explanations..."): | |
# evaluations = bench.evaluate_explanations( | |
# explanations, target=class_labels.index(target), apply_style=False | |
# ) | |
# st.markdown("### Faithfulness Metrics") | |
# st.dataframe(bench.show_evaluation_table(evaluations)) | |
# st.caption("Darker colors mean better performance.") | |
st.markdown( | |
""" | |
**Legend** | |
**Faithfulness** | |
- **AOPC Comprehensiveness** (aopc_compr) measures *comprehensiveness*, i.e., if the explanation captures all the tokens needed to make the prediction. Higher is better. | |
- **AOPC Sufficiency** (aopc_suff) measures *sufficiency*, i.e., if the relevant tokens in the explanation are sufficient to make the prediction. Lower is better. | |
- **Leave-On-Out TAU Correlation** (taucorr_loo) measures the Kendall rank correlation coefficient τ between the explanation and leave-one-out importances. Closer to 1 is better. | |
**Plausibility** | |
- **AUPRC plausibility** (auprc_plau) is the area under the precision-recall curve (AUPRC) of the explanation and the rationale as ground truth. Higher is better. | |
- **Intersection-Over-Union (IOU)** (token_iou_plau) is the size of the overlap of the most relevant tokens of the explanation and the human rationale divided by the size of their union. Higher is better. | |
- **Token-level F1 score** (token_f1_plau) measures the F1 score among the most relevant tokens and the human rationale. Higher is better. | |
See the paper for details. | |
""" | |
) | |
st.markdown( | |
""" | |
**In code, it would be as simple as** | |
""" | |
) | |
st.code( | |
f""" | |
from transformers import AutoModelForSequenceClassification, AutoTokenizer | |
from ferret import Benchmark | |
model = AutoModelForSequenceClassification.from_pretrained("{model_name}") | |
tokenizer = AutoTokenizer.from_pretrained("{model_name}") | |
bench = Benchmark(model, tokenizer) | |
data = bench.load_dataset("hatexplain") | |
evaluations = bench.evaluate_samples(data, {samples}) | |
bench.show_samples_evaluation_table(evaluations) | |
""", | |
language="python", | |
) | |