Report for AdamCodd/distilbert-base-uncased-finetuned-sentiment-amazon

#42
by inoki-giskard - opened

Hey Team!🤗✨
We’re thrilled to share some amazing evaluation results that’ll make your day!🎉📊

We have identified 12 potential vulnerabilities in your model based on an automated scan.

This automated analysis evaluated the model on the dataset sst2 (subset default, split validation).

👉Overconfidence issues (2)
Vulnerability Level Data slice Metric Transformation Deviation
Overconfidence major 🔴 avg_word_length(text) >= 4.481 Overconfidence rate = 0.804 +28.70% than global
🔍✨Examples For records in the dataset where `avg_word_length(text)` >= 4.481, we found a significantly higher number of overconfident wrong predictions (37 samples, corresponding to 80.43478260869566% of the wrong predictions in the data slice).
text avg_word_length(text) label Predicted label
95 this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . 4.61538 negative positive (p = 1.00)
negative (p = 0.00)
643 the jabs it employs are short , carefully placed and dead-center . 4.58333 positive negative (p = 1.00)
positive (p = 0.00)
218 all that 's missing is the spontaneity , originality and delight . 4.58333 negative positive (p = 0.99)
negative (p = 0.01)
Vulnerability Level Data slice Metric Transformation Deviation
Overconfidence major 🔴 avg_whitespace(text) < 0.182 Overconfidence rate = 0.804 +28.70% than global
🔍✨Examples For records in the dataset where `avg_whitespace(text)` < 0.182, we found a significantly higher number of overconfident wrong predictions (37 samples, corresponding to 80.43478260869566% of the wrong predictions in the data slice).
text avg_whitespace(text) label Predicted label
95 this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . 0.178082 negative positive (p = 1.00)
negative (p = 0.00)
643 the jabs it employs are short , carefully placed and dead-center . 0.179104 positive negative (p = 1.00)
positive (p = 0.00)
218 all that 's missing is the spontaneity , originality and delight . 0.179104 negative positive (p = 0.99)
negative (p = 0.01)
👉Ethical issues (1)
Vulnerability Level Data slice Metric Transformation Deviation
Ethical medium 🟡 Fail rate = 0.057 Switch countries from high- to low-income and vice versa 2/35 tested samples (5.71%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Switch countries from high- to low-income and vice versa”, the model changes its prediction in 5.71% of the cases. We expected the predictions not to be affected by this transformation.
text Switch countries from high- to low-income and vice versa(text) Original prediction Prediction after perturbation
149 the volatile dynamics of female friendship is the subject of this unhurried , low-key film that is so off-hollywood that it seems positively french in its rhythms and resonance . the volatile dynamics of female friendship is the subject of this unhurried , low-key film that is so off-hollywood that it seems positively South Sudanese in its rhythms and resonance . positive (p = 0.85) negative (p = 0.55)
236 not since japanese filmmaker akira kurosawa 's ran have the savagery of combat and the specter of death been visualized with such operatic grandeur . not since Malagasy filmmaker akira kurosawa 's ran have the savagery of combat and the specter of death been visualized with such operatic grandeur . negative (p = 0.51) positive (p = 0.79)
👉Robustness issues (1)
Vulnerability Level Data slice Metric Transformation Deviation
Robustness medium 🟡 Fail rate = 0.099 Add typos 80/811 tested samples (9.86%) changed prediction after perturbation
🔍✨Examples When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 9.86% of the cases. We expected the predictions not to be affected by this transformation.
text Add typos(text) Original prediction Prediction after perturbation
1 unflinchingly bleak and desperate unflinchingly bleak nd desperate positive (p = 0.86) negative (p = 0.70)
20 pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins . pumpkin takes an admirable lokok at the hypocriwy of politicql correcness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins . negative (p = 0.62) positive (p = 0.58)
21 the iditarod lasts for days - this just felt like it did . the iditarod ladts or days - this jus felt like it did . negative (p = 0.50) positive (p = 0.83)
👉Performance issues (8)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 idx >= 63.500 AND idx < 115.500 Accuracy = 0.750 -14.84% than global
🔍✨Examples For records in the dataset where `idx` >= 63.500 AND `idx` < 115.500, the Accuracy is 14.84% lower than the global Accuracy.
idx label Predicted label
64 64 negative positive (p = 0.99)
70 70 negative positive (p = 0.83)
78 78 positive negative (p = 0.54)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 text_length(text) < 37.500 Recall = 0.800 -12.08% than global
🔍✨Examples For records in the dataset where `text_length(text)` < 37.500, the Recall is 12.08% lower than the global Recall.
text text_length(text) label Predicted label
1 unflinchingly bleak and desperate 34 negative positive (p = 0.86)
112 hilariously inept and ridiculous . 35 positive negative (p = 0.99)
113 this movie is maddening . 26 negative positive (p = 0.96)
Vulnerability Level Data slice Metric Transformation Deviation
Performance major 🔴 text_length(text) < 65.500 AND text_length(text) >= 56.500 Precision = 0.769 -10.89% than global
🔍✨Examples For records in the dataset where `text_length(text)` < 65.500 AND `text_length(text)` >= 56.500, the Precision is 10.89% lower than the global Precision.
text text_length(text) label Predicted label
92 you wo n't like roger , but you will quickly recognize him . 61 negative positive (p = 0.75)
183 the lower your expectations , the more you 'll enjoy it . 58 negative positive (p = 0.97)
312 i 'll bet the video game is a lot more fun than the film . 59 negative positive (p = 0.60)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 idx >= 740.500 AND idx < 793.500 Recall = 0.828 -9.05% than global
🔍✨Examples For records in the dataset where `idx` >= 740.500 AND `idx` < 793.500, the Recall is 9.05% lower than the global Recall.
idx label Predicted label
741 741 positive negative (p = 0.81)
742 742 positive negative (p = 0.89)
749 749 positive negative (p = 0.92)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_word_length(text) >= 4.635 AND avg_word_length(text) < 4.743 Recall = 0.828 -9.05% than global
🔍✨Examples For records in the dataset where `avg_word_length(text)` >= 4.635 AND `avg_word_length(text)` < 4.743, the Recall is 9.05% lower than the global Recall.
text avg_word_length(text) label Predicted label
64 the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . 4.6875 negative positive (p = 0.99)
223 corny , schmaltzy and predictable , but still manages to be kind of heartwarming , nonetheless . 4.70588 positive negative (p = 0.99)
248 a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . 4.72727 positive negative (p = 0.54)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 avg_whitespace(text) < 0.177 AND avg_whitespace(text) >= 0.174 Recall = 0.828 -9.05% than global
🔍✨Examples For records in the dataset where `avg_whitespace(text)` < 0.177 AND `avg_whitespace(text)` >= 0.174, the Recall is 9.05% lower than the global Recall.
text avg_whitespace(text) label Predicted label
64 the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . 0.175824 negative positive (p = 0.99)
223 corny , schmaltzy and predictable , but still manages to be kind of heartwarming , nonetheless . 0.175258 positive negative (p = 0.99)
248 a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . 0.174603 positive negative (p = 0.54)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 idx >= 217.500 AND idx < 262.500 Recall = 0.840 -7.68% than global
🔍✨Examples For records in the dataset where `idx` >= 217.500 AND `idx` < 262.500, the Recall is 7.68% lower than the global Recall.
idx label Predicted label
218 218 negative positive (p = 0.99)
223 223 positive negative (p = 0.99)
230 230 positive negative (p = 0.97)
Vulnerability Level Data slice Metric Transformation Deviation
Performance medium 🟡 idx >= 338.500 AND idx < 382.500 Precision = 0.800 -7.33% than global
🔍✨Examples For records in the dataset where `idx` >= 338.500 AND `idx` < 382.500, the Precision is 7.33% lower than the global Precision.
idx label Predicted label
339 339 positive negative (p = 0.64)
346 346 negative positive (p = 0.99)
356 356 negative positive (p = 0.64)

Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.

💡 What's Next?

  • Checkout the Giskard Space and improve your model.
  • The Giskard community is always buzzing with ideas. 🐢🤔 What do you want to see next? Your feedback is our favorite fuel, so drop your thoughts in the community forum! 🗣️💬 Together, we're building something extraordinary.

🙌 Big Thanks!

We're grateful to have you on this adventure with us. 🚀🌟 Here's to more breakthroughs, laughter, and code magic! 🥂✨ Keep hugging that code and spreading the love! 💻 #Giskard #Huggingface #AISafety 🌈👏 Your enthusiasm, feedback, and contributions are what seek. 🌟 Keep being awesome!

Sign up or log in to comment