Update README.md
Browse files
README.md
CHANGED
@@ -399,7 +399,6 @@ You can finetune this model on your own dataset.
|
|
399 |
### Metrics
|
400 |
|
401 |
#### Binary Classification
|
402 |
-
* Dataset: `FineTuned_8`
|
403 |
* Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
|
404 |
|
405 |
| Metric | Value |
|
@@ -440,6 +439,15 @@ You can finetune this model on your own dataset.
|
|
440 |
| max_recall | 0.3936 |
|
441 |
| **max_ap** | **0.5012** |
|
442 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
443 |
<!--
|
444 |
## Bias, Risks and Limitations
|
445 |
|
@@ -455,6 +463,9 @@ You can finetune this model on your own dataset.
|
|
455 |
## Training Details
|
456 |
|
457 |
### Training Dataset
|
|
|
|
|
|
|
458 |
|
459 |
#### Unnamed Dataset
|
460 |
|
@@ -481,8 +492,8 @@ You can finetune this model on your own dataset.
|
|
481 |
```
|
482 |
|
483 |
### Evaluation Dataset
|
484 |
-
|
485 |
-
####
|
486 |
|
487 |
|
488 |
* Size: 18,355 evaluation samples
|
|
|
399 |
### Metrics
|
400 |
|
401 |
#### Binary Classification
|
|
|
402 |
* Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
|
403 |
|
404 |
| Metric | Value |
|
|
|
439 |
| max_recall | 0.3936 |
|
440 |
| **max_ap** | **0.5012** |
|
441 |
|
442 |
+
|
443 |
+
The following figure depicts f1, recall, and precision on the test data for different thresholds.
|
444 |
+
![](./threshold_scores.jpg)
|
445 |
+
|
446 |
+
|
447 |
+
The following figure depicts how well matches and mismatches in the test data are separated by the model. For results with a minimum of false positives, a threshold higher than 0.91 is recommended. For the optimal F1 score, the right treshold is 0.9050.
|
448 |
+
![](./similarity_histogram.jpg)
|
449 |
+
|
450 |
+
|
451 |
<!--
|
452 |
## Bias, Risks and Limitations
|
453 |
|
|
|
463 |
## Training Details
|
464 |
|
465 |
### Training Dataset
|
466 |
+
The model was trained on a weakly annotated dataset. The data was taken from Telegram. More specifically from a set of about 200 channels that have been subject to a fact-check from either Correctiv, dpa, Faktenfuchs or AFP.
|
467 |
+
|
468 |
+
Weak annotation was performed using GPT-4o. The model was prompted to find semantically identical posts using this [prompt](https://huggingface.co/Sami92/multiling-e5-large-instruct-claim-matching/blob/main/prompt.txt). For non-matches the cosine similarity was reduced by 1.2 for training and for matches it was frozen to 0.98.
|
469 |
|
470 |
#### Unnamed Dataset
|
471 |
|
|
|
492 |
```
|
493 |
|
494 |
### Evaluation Dataset
|
495 |
+
Evaluation was performed on a dataset from the same Telegram channels as the training data. Again, GPT-4o was used to identify matching claims. However, for the test data, trained annotators validated the results and mismatches that were classified as matches by GPT-4o were removed. A ratio of 1:30 was chosen. In other words, for 1 match there are 30 mismatches. This is supposed to reflect a realistic scenario in which there are much more posts that are not identical to a query-post.
|
496 |
+
#### Manually checked Telegram Dataset
|
497 |
|
498 |
|
499 |
* Size: 18,355 evaluation samples
|