USR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
Abstract
The lack of meaningful automatic evaluation metrics for <PRE_TAG><PRE_TAG>dialog</POST_TAG></POST_TAG> has impeded open-domain <PRE_TAG><PRE_TAG>dialog</POST_TAG></POST_TAG> research. Standard language generation metrics have been shown to be ineffective for evaluating <PRE_TAG><PRE_TAG><PRE_TAG>dialog</POST_TAG></POST_TAG> models</POST_TAG>. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for <PRE_TAG><PRE_TAG>dialog</POST_TAG></POST_TAG>. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of <PRE_TAG><PRE_TAG>dialog</POST_TAG></POST_TAG>. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level: 0.42, system-level: 1.0) and PersonaChat (turn-level: 0.48 and system-level: 1.0). USR additionally produces interpretable measures for several desirable properties of <PRE_TAG><PRE_TAG>dialog</POST_TAG></POST_TAG>.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper