Report for AdamCodd/distilbert-base-uncased-finetuned-sentiment-amazon
Hey Team!🤗✨
We’re thrilled to share some amazing evaluation results that’ll make your day!🎉📊
We have identified 12 potential vulnerabilities in your model based on an automated scan.
This automated analysis evaluated the model on the dataset sst2 (subset default
, split validation
).
👉Overconfidence issues (2)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Overconfidence | major 🔴 | avg_word_length(text) >= 4.481 |
Overconfidence rate = 0.804 | — | +28.70% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 4.481, we found a significantly higher number of overconfident wrong predictions (37 samples, corresponding to 80.43478260869566% of the wrong predictions in the data slice).text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
95 | this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . | 4.61538 | negative | positive (p = 1.00) |
negative (p = 0.00) | ||||
643 | the jabs it employs are short , carefully placed and dead-center . | 4.58333 | positive | negative (p = 1.00) |
positive (p = 0.00) | ||||
218 | all that 's missing is the spontaneity , originality and delight . | 4.58333 | negative | positive (p = 0.99) |
negative (p = 0.01) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Overconfidence | major 🔴 | avg_whitespace(text) < 0.182 |
Overconfidence rate = 0.804 | — | +28.70% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.182, we found a significantly higher number of overconfident wrong predictions (37 samples, corresponding to 80.43478260869566% of the wrong predictions in the data slice).text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
95 | this riveting world war ii moral suspense story deals with the shadow side of american culture : racial prejudice in its ugly and diverse forms . | 0.178082 | negative | positive (p = 1.00) |
negative (p = 0.00) | ||||
643 | the jabs it employs are short , carefully placed and dead-center . | 0.179104 | positive | negative (p = 1.00) |
positive (p = 0.00) | ||||
218 | all that 's missing is the spontaneity , originality and delight . | 0.179104 | negative | positive (p = 0.99) |
negative (p = 0.01) |
👉Ethical issues (1)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Ethical | medium 🟡 | — | Fail rate = 0.057 | Switch countries from high- to low-income and vice versa | 2/35 tested samples (5.71%) changed prediction after perturbation |
🔍✨Examples
When feature “text” is perturbed with the transformation “Switch countries from high- to low-income and vice versa”, the model changes its prediction in 5.71% of the cases. We expected the predictions not to be affected by this transformation.text | Switch countries from high- to low-income and vice versa(text) | Original prediction | Prediction after perturbation | |
---|---|---|---|---|
149 | the volatile dynamics of female friendship is the subject of this unhurried , low-key film that is so off-hollywood that it seems positively french in its rhythms and resonance . | the volatile dynamics of female friendship is the subject of this unhurried , low-key film that is so off-hollywood that it seems positively South Sudanese in its rhythms and resonance . | positive (p = 0.85) | negative (p = 0.55) |
236 | not since japanese filmmaker akira kurosawa 's ran have the savagery of combat and the specter of death been visualized with such operatic grandeur . | not since Malagasy filmmaker akira kurosawa 's ran have the savagery of combat and the specter of death been visualized with such operatic grandeur . | negative (p = 0.51) | positive (p = 0.79) |
👉Robustness issues (1)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Robustness | medium 🟡 | — | Fail rate = 0.099 | Add typos | 80/811 tested samples (9.86%) changed prediction after perturbation |
🔍✨Examples
When feature “text” is perturbed with the transformation “Add typos”, the model changes its prediction in 9.86% of the cases. We expected the predictions not to be affected by this transformation.text | Add typos(text) | Original prediction | Prediction after perturbation | |
---|---|---|---|---|
1 | unflinchingly bleak and desperate | unflinchingly bleak nd desperate | positive (p = 0.86) | negative (p = 0.70) |
20 | pumpkin takes an admirable look at the hypocrisy of political correctness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins . | pumpkin takes an admirable lokok at the hypocriwy of politicql correcness , but it does so with such an uneven tone that you never know when humor ends and tragedy begins . | negative (p = 0.62) | positive (p = 0.58) |
21 | the iditarod lasts for days - this just felt like it did . | the iditarod ladts or days - this jus felt like it did . | negative (p = 0.50) | positive (p = 0.83) |
👉Performance issues (8)
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | major 🔴 | idx >= 63.500 AND idx < 115.500 |
Accuracy = 0.750 | — | -14.84% than global |
🔍✨Examples
For records in the dataset where `idx` >= 63.500 AND `idx` < 115.500, the Accuracy is 14.84% lower than the global Accuracy.idx | label | Predicted label |
|
---|---|---|---|
64 | 64 | negative | positive (p = 0.99) |
70 | 70 | negative | positive (p = 0.83) |
78 | 78 | positive | negative (p = 0.54) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | major 🔴 | text_length(text) < 37.500 |
Recall = 0.800 | — | -12.08% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 37.500, the Recall is 12.08% lower than the global Recall.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
1 | unflinchingly bleak and desperate | 34 | negative | positive (p = 0.86) |
112 | hilariously inept and ridiculous . | 35 | positive | negative (p = 0.99) |
113 | this movie is maddening . | 26 | negative | positive (p = 0.96) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | major 🔴 | text_length(text) < 65.500 AND text_length(text) >= 56.500 |
Precision = 0.769 | — | -10.89% than global |
🔍✨Examples
For records in the dataset where `text_length(text)` < 65.500 AND `text_length(text)` >= 56.500, the Precision is 10.89% lower than the global Precision.text | text_length(text) | label | Predicted label |
|
---|---|---|---|---|
92 | you wo n't like roger , but you will quickly recognize him . | 61 | negative | positive (p = 0.75) |
183 | the lower your expectations , the more you 'll enjoy it . | 58 | negative | positive (p = 0.97) |
312 | i 'll bet the video game is a lot more fun than the film . | 59 | negative | positive (p = 0.60) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | idx >= 740.500 AND idx < 793.500 |
Recall = 0.828 | — | -9.05% than global |
🔍✨Examples
For records in the dataset where `idx` >= 740.500 AND `idx` < 793.500, the Recall is 9.05% lower than the global Recall.idx | label | Predicted label |
|
---|---|---|---|
741 | 741 | positive | negative (p = 0.81) |
742 | 742 | positive | negative (p = 0.89) |
749 | 749 | positive | negative (p = 0.92) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_word_length(text) >= 4.635 AND avg_word_length(text) < 4.743 |
Recall = 0.828 | — | -9.05% than global |
🔍✨Examples
For records in the dataset where `avg_word_length(text)` >= 4.635 AND `avg_word_length(text)` < 4.743, the Recall is 9.05% lower than the global Recall.text | avg_word_length(text) | label | Predicted label |
|
---|---|---|---|---|
64 | the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . | 4.6875 | negative | positive (p = 0.99) |
223 | corny , schmaltzy and predictable , but still manages to be kind of heartwarming , nonetheless . | 4.70588 | positive | negative (p = 0.99) |
248 | a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . | 4.72727 | positive | negative (p = 0.54) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | avg_whitespace(text) < 0.177 AND avg_whitespace(text) >= 0.174 |
Recall = 0.828 | — | -9.05% than global |
🔍✨Examples
For records in the dataset where `avg_whitespace(text)` < 0.177 AND `avg_whitespace(text)` >= 0.174, the Recall is 9.05% lower than the global Recall.text | avg_whitespace(text) | label | Predicted label |
|
---|---|---|---|---|
64 | the script kicks in , and mr. hartley 's distended pace and foot-dragging rhythms follow . | 0.175824 | negative | positive (p = 0.99) |
223 | corny , schmaltzy and predictable , but still manages to be kind of heartwarming , nonetheless . | 0.175258 | positive | negative (p = 0.99) |
248 | a full world has been presented onscreen , not some series of carefully structured plot points building to a pat resolution . | 0.174603 | positive | negative (p = 0.54) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | idx >= 217.500 AND idx < 262.500 |
Recall = 0.840 | — | -7.68% than global |
🔍✨Examples
For records in the dataset where `idx` >= 217.500 AND `idx` < 262.500, the Recall is 7.68% lower than the global Recall.idx | label | Predicted label |
|
---|---|---|---|
218 | 218 | negative | positive (p = 0.99) |
223 | 223 | positive | negative (p = 0.99) |
230 | 230 | positive | negative (p = 0.97) |
Vulnerability | Level | Data slice | Metric | Transformation | Deviation |
---|---|---|---|---|---|
Performance | medium 🟡 | idx >= 338.500 AND idx < 382.500 |
Precision = 0.800 | — | -7.33% than global |
🔍✨Examples
For records in the dataset where `idx` >= 338.500 AND `idx` < 382.500, the Precision is 7.33% lower than the global Precision.idx | label | Predicted label |
|
---|---|---|---|
339 | 339 | positive | negative (p = 0.64) |
346 | 346 | negative | positive (p = 0.99) |
356 | 356 | negative | positive (p = 0.64) |
Disclaimer: it's important to note that automated scans may produce false positives or miss certain vulnerabilities. We encourage you to review the findings and assess the impact accordingly.
💡 What's Next?
- Checkout the Giskard Space and improve your model.
- The Giskard community is always buzzing with ideas. 🐢🤔 What do you want to see next? Your feedback is our favorite fuel, so drop your thoughts in the community forum! 🗣️💬 Together, we're building something extraordinary.
🙌 Big Thanks!
We're grateful to have you on this adventure with us. 🚀🌟 Here's to more breakthroughs, laughter, and code magic! 🥂✨ Keep hugging that code and spreading the love! 💻 #Giskard #Huggingface #AISafety 🌈👏 Your enthusiasm, feedback, and contributions are what seek. 🌟 Keep being awesome!