Sami92 commited on
Commit
426f593
1 Parent(s): 610be85

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -3
README.md CHANGED
@@ -399,7 +399,6 @@ You can finetune this model on your own dataset.
399
  ### Metrics
400
 
401
  #### Binary Classification
402
- * Dataset: `FineTuned_8`
403
  * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
404
 
405
  | Metric | Value |
@@ -440,6 +439,15 @@ You can finetune this model on your own dataset.
440
  | max_recall | 0.3936 |
441
  | **max_ap** | **0.5012** |
442
 
 
 
 
 
 
 
 
 
 
443
  <!--
444
  ## Bias, Risks and Limitations
445
 
@@ -455,6 +463,9 @@ You can finetune this model on your own dataset.
455
  ## Training Details
456
 
457
  ### Training Dataset
 
 
 
458
 
459
  #### Unnamed Dataset
460
 
@@ -481,8 +492,8 @@ You can finetune this model on your own dataset.
481
  ```
482
 
483
  ### Evaluation Dataset
484
-
485
- #### Unnamed Dataset
486
 
487
 
488
  * Size: 18,355 evaluation samples
 
399
  ### Metrics
400
 
401
  #### Binary Classification
 
402
  * Evaluated with [<code>BinaryClassificationEvaluator</code>](https://sbert.net/docs/package_reference/sentence_transformer/evaluation.html#sentence_transformers.evaluation.BinaryClassificationEvaluator)
403
 
404
  | Metric | Value |
 
439
  | max_recall | 0.3936 |
440
  | **max_ap** | **0.5012** |
441
 
442
+
443
+ The following figure depicts f1, recall, and precision on the test data for different thresholds.
444
+ ![](./threshold_scores.jpg)
445
+
446
+
447
+ The following figure depicts how well matches and mismatches in the test data are separated by the model. For results with a minimum of false positives, a threshold higher than 0.91 is recommended. For the optimal F1 score, the right treshold is 0.9050.
448
+ ![](./similarity_histogram.jpg)
449
+
450
+
451
  <!--
452
  ## Bias, Risks and Limitations
453
 
 
463
  ## Training Details
464
 
465
  ### Training Dataset
466
+ The model was trained on a weakly annotated dataset. The data was taken from Telegram. More specifically from a set of about 200 channels that have been subject to a fact-check from either Correctiv, dpa, Faktenfuchs or AFP.
467
+
468
+ Weak annotation was performed using GPT-4o. The model was prompted to find semantically identical posts using this [prompt](https://huggingface.co/Sami92/multiling-e5-large-instruct-claim-matching/blob/main/prompt.txt). For non-matches the cosine similarity was reduced by 1.2 for training and for matches it was frozen to 0.98.
469
 
470
  #### Unnamed Dataset
471
 
 
492
  ```
493
 
494
  ### Evaluation Dataset
495
+ Evaluation was performed on a dataset from the same Telegram channels as the training data. Again, GPT-4o was used to identify matching claims. However, for the test data, trained annotators validated the results and mismatches that were classified as matches by GPT-4o were removed. A ratio of 1:30 was chosen. In other words, for 1 match there are 30 mismatches. This is supposed to reflect a realistic scenario in which there are much more posts that are not identical to a query-post.
496
+ #### Manually checked Telegram Dataset
497
 
498
 
499
  * Size: 18,355 evaluation samples