Spaces:
Build error
A newer version of the Gradio SDK is available:
5.5.0
title: vendiscore
datasets:
- null
tags:
- evaluate
- metric
description: >-
The Vendi Score is a metric for evaluating diversity in machine learning. See
the project's README at https://github.com/vertaix/Vendi-Score for more
information.
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
Metric Card for VendiScore
The Vendi Score (VS) is a metric for evaluating diversity in machine learning. The input to metric is a collection of samples and a pairwise similarity function, and the output is a number, which can be interpreted as the effective number of unique elements in the sample. See the project's README at https://github.com/vertaix/Vendi-Score for more information.
Metric Description
The Vendi Score (VS) is a metric for evaluating diversity in machine learning.
The input to metric is a collection of samples and a pairwise similarity function, and the output is a number, which can be interpreted as the effective number of unique elements in the sample.
Specifically, given an n x n
positive semi-definite matrix K
of similarity scores, the score is defined as:
VS(K) = exp(tr(K/n @ log(K/n))) = exp(-sum_i lambda_i log lambda_i),
where lambda_i
are the eigenvalues of K/n
and 0 log 0 = 0
.
That is, the Vendi Score is equal to the exponential of the von Neumann entropy of K/n
, or the Shannon entropy of the eigenvalues, which is also known as the effective rank.
For more information, please see our paper, The Vendi Score: A Diversity Evaluation Metric for Machine Learning.
How to Use
The Vendi Score is available as a Python package or in HuggingFace evaluate
.
To use the Python package, see the instructions at https://github.com/vertaix/Vendi-Score.
The evaluate
module supports text, numbers, and precomputed similarity scores or feature embeddings.
Please use the Python package for more support for images and other datatypes.
To use the evaluate
module, first install the requirements:
pip install evaluate
pip install vendi_score[all]
To calculate the score, pass a list of samples and a similarity function or a string identifying a predefined class of similarity functions (see below).
>>> vendiscore = evaluate.load("Vertaix/vendiscore", "text")
>>> sents = ["Look, Jane.", "See Spot.", "See Spot run.", "Run, Spot, run.", "Jane sees Spot run."]
>>> results = vendiscore.compute(samples=sents, k="ngram_overlap", ns=[1, 2])
>>> print(results)
{'VS': 3.90657...}
Inputs
- samples: an iterable containing n samples to score; an n x n similarity matrix K, or an n x d feature matrix X.
- k: a pairwise similarity function, or a string identifying a predefined similarity function. If k is a pairwise similarity function, it should be symmetric and k(x, x) = 1. Options: ngram_overlap, text_embeddings.
- score_K: if true, samples is an n x n similarity matrix K.
- score_X: if true, samples is an n x d feature matrix X.
- score_dual: if true, samples is an n x d feature matrix X and we will compute the diversity score using the covariance matrix X @ X.T.
- normalize: if true, normalize the similarity scores.
- model (optional): if k is "text_embeddings", a model mapping sentences to
embeddings (output should be an object with an attribute called
pooler_output
orlast_hidden_state
). - tokenizer (optional): if k is "text_embeddings" or "ngram_overlap", a tokenizer mapping strings to lists.
- model_path (optional): if k is "text_embeddings", the name of a model on the HuggingFace hub.
- ns (optional): if k is "ngram_overlap", the values of n to calculate.
- batch_size (optional): batch size to use if k is "text_embedding".
- device (optional): a string (e.g. "cuda", "cpu") or torch.device identifying the device to use if k is "text_embedding".
Output Values
The output is a dictionary with one key, "VS". Given n samples, the value of the Vendi Score ranges between 1 and n, with higher numbers indicating that the sample is more diverse.
Examples
>>> import numpy as np
>>> vendiscore = evaluate.load("Vertaix/vendiscore", "int")
>>> samples = [0, 0, 10, 10, 20, 20]
>>> k = lambda a, b: np.exp(-np.abs(a - b))
>>> vendiscore.compute(samples=samples, k=k)
{'VS': 2.9999...}
If you already have precomputed a similarity matrix:
>>> vendiscore = evaluate.load("Vertaix/vendiscore", "K")
>>> K = np.array([[1.0, 0.9, 0.0],
[0.9, 1.0, 0.0],
[0.0, 0.0, 1.0]])
>>> vendiscore.compute(samples=K, score_K=True)
{'VS': 2.1573...}
If your similarity function is a dot product between n
normalized
d
-dimensional embeddings X
, and d
< n
, it is faster
to compute the Vendi Score using the covariance matrix, X @ X.T
.
(If the rows of X
are not normalized, set normalize = True
.)
>>> vendiscore = evaluate.load("Vertaix/vendiscore", "X")
>>> X = np.array([[100, 0], [99, 1], [1, 99], [0, 100]])
>>> vendiscore.compute(samples=X, score_dual=True, normalize=True)
{'VS': 1.99989...}
Text similarity can be calculated using n-gram overlap or using inner products between embeddings from a neural network.
>>> vendiscore = evaluate.load("Vertaix/vendiscore", "text")
>>> sents = ["Look, Jane.", "See Spot.", "See Spot run.", "Run, Spot, run.", "Jane sees Spot run."]
>>> ngram_vs = vendiscore.compute(samples=sents, k="ngram_overlap", ns=[1, 2])["VS"]
>>> bert_vs = vendiscore.compute(samples=sents, k="text_embeddings", model_path="bert-base-uncased")["VS"]
>>> print(f"N-grams: {ngram_vs:.02f}, BERT: {bert_vs:.02f}")
N-grams: 3.91, BERT: 1.21
Limitations and Bias
The Vendi Score depends on the choice of similarity function. Care should be taken to select a similarity function that reflects the features that are relevant for defining diversity in a given application.
Citation
@article{friedman2022vendi,
title={The Vendi Score: A Diversity Evaluation Metric for Machine Learning},
author={Friedman, Dan and Dieng, Adji Bousso},
journal={arXiv preprint arXiv:2210.02410},
year={2022}
}