Update README.md

2cc4559 over 1 year ago

6.09 kB

	---
	language: en
	tags:
	- text-classification
	- onnx
	- emotions
	- multi-class-classification
	- multi-label-classification
	datasets:
	- go_emotions
	license: mit
	inference: false
	widget:
	- text: ONNX is so much faster, its very handy!
	---

	### Overview

	This is a multi-label, multi-class linear classifer for emotions that works with [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co./sentence-transformers/all-MiniLM-L6-v2), having been trained on the [go_emotions](https://huggingface.co./datasets/go_emotions) dataset.

	### Labels

	The 28 labels from the [go_emotions](https://huggingface.co./datasets/go_emotions) dataset are:
	```
	['admiration', 'amusement', 'anger', 'annoyance', 'approval', 'caring', 'confusion', 'curiosity', 'desire', 'disappointment', 'disapproval', 'disgust', 'embarrassment', 'excitement', 'fear', 'gratitude', 'grief', 'joy', 'love', 'nervousness', 'optimism', 'pride', 'realization', 'relief', 'remorse', 'sadness', 'surprise', 'neutral']
	```

	### Metrics (exact match of labels per item)

	This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification. Evaluating across all labels per item in the go_emotions test split the metrics are shown below.

	Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:

	- Precision: 0.384
	- Recall: 0.438
	- F1: 0.397

	Weighted by the relative support of each label in the dataset, this is:

	- Precision: 0.443
	- Recall: 0.552
	- F1: 0.484

	Using a fixed threshold of 0.5 to convert the scores to binary predictions for each label, the metrics (evaluated on the go_emotions test split, and unweighted by support) are:

	- Precision: 0.551
	- Recall: 0.211
	- F1: 0.261

	### Metrics (per-label)

	This is a multi-label, multi-class dataset, so each label is effectively a separate binary classification and metrics are better measured per label.

	Optimising the threshold per label to optimise the F1 metric, the metrics (evaluated on the go_emotions test split) are:
	\| \| f1 \| precision \| recall \| support \| threshold \|
	\| -------------- \| ----- \| --------- \| ------ \| ------- \| --------- \|
	\| admiration \| 0.529 \| 0.499 \| 0.563 \| 504 \| 0.25 \|
	\| amusement \| 0.733 \| 0.672 \| 0.807 \| 264 \| 0.20 \|
	\| anger \| 0.394 \| 0.363 \| 0.429 \| 198 \| 0.15 \|
	\| annoyance \| 0.293 \| 0.252 \| 0.350 \| 320 \| 0.15 \|
	\| approval \| 0.292 \| 0.345 \| 0.254 \| 351 \| 0.20 \|
	\| caring \| 0.320 \| 0.270 \| 0.393 \| 135 \| 0.15 \|
	\| confusion \| 0.291 \| 0.276 \| 0.307 \| 153 \| 0.15 \|
	\| curiosity \| 0.366 \| 0.307 \| 0.454 \| 284 \| 0.15 \|
	\| desire \| 0.317 \| 0.269 \| 0.386 \| 83 \| 0.15 \|
	\| disappointment \| 0.159 \| 0.127 \| 0.212 \| 151 \| 0.10 \|
	\| disapproval \| 0.306 \| 0.341 \| 0.277 \| 267 \| 0.20 \|
	\| disgust \| 0.405 \| 0.412 \| 0.398 \| 123 \| 0.20 \|
	\| embarrassment \| 0.364 \| 0.414 \| 0.324 \| 37 \| 0.35 \|
	\| excitement \| 0.296 \| 0.232 \| 0.408 \| 103 \| 0.15 \|
	\| fear \| 0.496 \| 0.576 \| 0.436 \| 78 \| 0.40 \|
	\| gratitude \| 0.793 \| 0.787 \| 0.798 \| 352 \| 0.30 \|
	\| grief \| 0.323 \| 0.200 \| 0.833 \| 6 \| 0.45 \|
	\| joy \| 0.402 \| 0.341 \| 0.491 \| 161 \| 0.15 \|
	\| love \| 0.640 \| 0.679 \| 0.605 \| 238 \| 0.30 \|
	\| nervousness \| 0.263 \| 0.333 \| 0.217 \| 23 \| 0.70 \|
	\| optimism \| 0.433 \| 0.453 \| 0.414 \| 186 \| 0.20 \|
	\| pride \| 0.429 \| 0.500 \| 0.375 \| 16 \| 0.50 \|
	\| realization \| 0.177 \| 0.159 \| 0.200 \| 145 \| 0.10 \|
	\| relief \| 0.182 \| 0.182 \| 0.182 \| 11 \| 0.40 \|
	\| remorse \| 0.541 \| 0.500 \| 0.589 \| 56 \| 0.30 \|
	\| sadness \| 0.461 \| 0.467 \| 0.455 \| 156 \| 0.20 \|
	\| surprise \| 0.302 \| 0.299 \| 0.305 \| 141 \| 0.15 \|
	\| neutral \| 0.620 \| 0.505 \| 0.803 \| 1787 \| 0.30 \|

	The thesholds are stored in `thresholds.json`.

	### Use with ONNXRuntime

	The input to the model is called `logits`, and there is one output per label. Each output produces a 2d array, with 1 row per input row, and each row having 2 columns - the first being a proba output for the negative case, and the second being a proba output for the positive case.

	```python
	# Assuming you have embeddings from all-MiniLM-L6-v2 for the input sentences
	# E.g. produced from sentence-transformers such as:
	# huggingface.co/sentence-transformers/all-MiniLM-L6-v2
	# or from an ONNX version E.g. huggingface.co/Xenova/all-MiniLM-L6-v2

	print(embeddings.shape) # E.g. a batch of 1 sentence
	> (1, 384)

	import onnxruntime as ort

	sess = ort.InferenceSession("path_to_model_dot_onnx", providers=['CPUExecutionProvider'])

	outputs = [o.name for o in sess.get_outputs()] # list of labels, in the order of the outputs
	preds_onnx = sess.run(_outputs, {'logits': embeddings})
	# preds_onnx is a list with 28 entries, one per label,
	# each with a numpy array of shape (1, 2) given the input was a batch of 1

	print(outputs[0])
	> surprise
	print(preds_onnx[0])
	> array([[0.97136074, 0.02863926]], dtype=float32)

	# load thresholds.json and use that (per label) to convert the positive case score to a binary prediction
	```

	### Commentary on the dataset

	Some labels (E.g. gratitude) when considered independently perform very strongly, whilst others (E.g. relief) perform very poorly.

	This is a challenging dataset. Labels such as relief do have much fewer examples in the training data (less than 100 out of the 40k+, and only 11 in the test split).

	But there is also some ambiguity and/or labelling errors visible in the training data of go_emotions that is suspected to constrain the performance. Data cleaning on the dataset to reduce some of the mistakes, ambiguity, conflicts and duplication in the labelling would produce a higher performing model.