Spaces:
Running
Running
ango
commited on
Commit
•
20894e3
1
Parent(s):
99a5a4f
update Wrong Hit & Wrong Value part
Browse files- assets/content.py +52 -1
- components/result.py +2 -1
- components/submit.py +1 -1
assets/content.py
CHANGED
@@ -17,6 +17,10 @@ and some questions will be eliminated at the end of the season.
|
|
17 |
|
18 |
Read more details in "About" page!
|
19 |
"""
|
|
|
|
|
|
|
|
|
20 |
KEYPOINT_TEXT = """
|
21 |
Because single question may contains more than one keypoint, so the total number of keypoint count is higher than question count
|
22 |
"""
|
@@ -52,7 +56,7 @@ category_result: This file provides statistical data at the Keypoint level.
|
|
52 |
difficulty_result: This file includes statistical data categorized by difficulty level.
|
53 |
"""
|
54 |
SUBMIT_TEXT = """
|
55 |
-
|
56 |
"""
|
57 |
|
58 |
ABOUT_HTML = """
|
@@ -119,6 +123,31 @@ ABOUT_HTML = """
|
|
119 |
</p>
|
120 |
</div>
|
121 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
122 |
<h1 id="space-title">Evaluation(Not Implement Yet)</h1>
|
123 |
<p>To mitigate the impact of data leakage during model pretraining on benchmark evaluations, we have employed multiple benchmark evaluation tricks to enhance fairness and real-time performance of the benchmarks.</p>
|
124 |
|
@@ -127,7 +156,29 @@ ABOUT_HTML = """
|
|
127 |
|
128 |
<h4>Season For Dynamic Evaluation</h4>
|
129 |
<p>Thanks to sampling strategies optimized for ANGO, we can periodically sample the test set and update the leaderboard. This prevents certain institutions or individuals from maliciously hacking ANGO to inflate the model's performance. However, due to the limited number of questions in some key areas, dynamic iteration may not be feasible for all questions.</p>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
130 |
|
|
|
131 |
<h4>Question Elimination Mechanism</h4>
|
132 |
<p>In addition to the aforementioned dynamic updating of season, a new question elimination mechanism has been proposed. This mechanism calculates the average accuracy of each question across all models for each iteration. Questions with accuracies exceeding a threshold are temporarily removed by ANGO to ensure reliable discrimination among questions in ANGO.</p>
|
133 |
"""
|
|
|
17 |
|
18 |
Read more details in "About" page!
|
19 |
"""
|
20 |
+
QUESTION_TEXT = r"""
|
21 |
+
About Wrong Hit & Wrong Value, pls go to "About" page
|
22 |
+
"""
|
23 |
+
|
24 |
KEYPOINT_TEXT = """
|
25 |
Because single question may contains more than one keypoint, so the total number of keypoint count is higher than question count
|
26 |
"""
|
|
|
56 |
difficulty_result: This file includes statistical data categorized by difficulty level.
|
57 |
"""
|
58 |
SUBMIT_TEXT = """
|
59 |
+
Need More Resource
|
60 |
"""
|
61 |
|
62 |
ABOUT_HTML = """
|
|
|
123 |
</p>
|
124 |
</div>
|
125 |
|
126 |
+
<h1 id="space-title">Wrong Hit & Wrong Value</h1>
|
127 |
+
<p>There are two special attributes in ANGO:</p>
|
128 |
+
|
129 |
+
<ul>
|
130 |
+
<li>
|
131 |
+
<strong>Human Acc:</strong> Refers to the accuracy of humans in this question.
|
132 |
+
</li>
|
133 |
+
<li>
|
134 |
+
<strong>Most Wrong:</strong> Represents the option that humans are prone to get wrong.
|
135 |
+
</li>
|
136 |
+
</ul>
|
137 |
+
|
138 |
+
<p>So based on these two attributes, we have derived two new metrics for evaluation:</p>
|
139 |
+
|
140 |
+
<ul>
|
141 |
+
<li>
|
142 |
+
<strong>Wrong Hit:</strong> Refers to the number of times the model's incorrect predictions match the options that humans are prone to get wrong.
|
143 |
+
</li>
|
144 |
+
<li>
|
145 |
+
<strong>Wrong Value:</strong> Calculated by taking the average of the human accuracy for all the questions in wrong_hit and subtracting that value from 1.
|
146 |
+
</li>
|
147 |
+
</ul>
|
148 |
+
|
149 |
+
<p>Wrong Value and Wrong Hit do not express the model's ability to perfectly solve the problem, but rather to some extent demonstrate the similarity between the model and real humans. Due to intentional guidance or design errors in the questions, humans often exhibit a tendency for widespread errors. In such cases, if the model's predicted answer is similar to the widespread human error tendency, it indicates that the model's way of thinking is closer to that of the majority of ordinary humans.</p>
|
150 |
+
|
151 |
<h1 id="space-title">Evaluation(Not Implement Yet)</h1>
|
152 |
<p>To mitigate the impact of data leakage during model pretraining on benchmark evaluations, we have employed multiple benchmark evaluation tricks to enhance fairness and real-time performance of the benchmarks.</p>
|
153 |
|
|
|
156 |
|
157 |
<h4>Season For Dynamic Evaluation</h4>
|
158 |
<p>Thanks to sampling strategies optimized for ANGO, we can periodically sample the test set and update the leaderboard. This prevents certain institutions or individuals from maliciously hacking ANGO to inflate the model's performance. However, due to the limited number of questions in some key areas, dynamic iteration may not be feasible for all questions.</p>
|
159 |
+
<p>There are two special attributes in ANGO:</p>
|
160 |
+
|
161 |
+
<ul>
|
162 |
+
<li>
|
163 |
+
<strong>Human Acc:</strong> Refers to the accuracy of humans in this question.
|
164 |
+
</li>
|
165 |
+
<li>
|
166 |
+
<strong>Most Wrong:</strong> Represents the option that humans are prone to get wrong.
|
167 |
+
</li>
|
168 |
+
</ul>
|
169 |
+
|
170 |
+
<p>So based on these two attributes, we have derived two new metrics for evaluation:</p>
|
171 |
+
|
172 |
+
<ul>
|
173 |
+
<li>
|
174 |
+
<strong>Wrong Hit:</strong> Refers to the number of times the model's incorrect predictions match the options that humans are prone to get wrong.
|
175 |
+
</li>
|
176 |
+
<li>
|
177 |
+
<strong>Wrong Value:</strong> Calculated by taking the average of the human accuracy for all the questions in wrong_hit and subtracting that value from 1.
|
178 |
+
</li>
|
179 |
+
</ul>
|
180 |
|
181 |
+
<p>Wrong Value and Wrong Hit do not express the model's ability to perfectly solve the problem, but rather to some extent demonstrate the similarity between the model and real humans. Due to intentional guidance or design errors in the questions, humans often exhibit a tendency for widespread errors. In such cases, if the model's predicted answer is similar to the widespread human error tendency, it indicates that the model's way of thinking is closer to that of the majority of ordinary humans.</p>
|
182 |
<h4>Question Elimination Mechanism</h4>
|
183 |
<p>In addition to the aforementioned dynamic updating of season, a new question elimination mechanism has been proposed. This mechanism calculates the average accuracy of each question across all models for each iteration. Questions with accuracies exceeding a threshold are temporarily removed by ANGO to ensure reliable discrimination among questions in ANGO.</p>
|
184 |
"""
|
components/result.py
CHANGED
@@ -5,7 +5,7 @@ import gradio as gr
|
|
5 |
import pandas as pd
|
6 |
|
7 |
from assets.constant import DELIMITER
|
8 |
-
from assets.content import KEYPOINT_TEXT
|
9 |
from assets.path import SEASON
|
10 |
|
11 |
|
@@ -49,6 +49,7 @@ def build_difficulty(season):
|
|
49 |
|
50 |
def create_result(top_components):
|
51 |
with gr.Tab("Question Level"):
|
|
|
52 |
question_df = gr.DataFrame(build_question("latest"), label="Acc Result")
|
53 |
with gr.Tab("Keypoint Level"):
|
54 |
gr.Markdown(KEYPOINT_TEXT)
|
|
|
5 |
import pandas as pd
|
6 |
|
7 |
from assets.constant import DELIMITER
|
8 |
+
from assets.content import KEYPOINT_TEXT, QUESTION_TEXT
|
9 |
from assets.path import SEASON
|
10 |
|
11 |
|
|
|
49 |
|
50 |
def create_result(top_components):
|
51 |
with gr.Tab("Question Level"):
|
52 |
+
gr.Markdown(QUESTION_TEXT)
|
53 |
question_df = gr.DataFrame(build_question("latest"), label="Acc Result")
|
54 |
with gr.Tab("Keypoint Level"):
|
55 |
gr.Markdown(KEYPOINT_TEXT)
|
components/submit.py
CHANGED
@@ -12,4 +12,4 @@ def create_submit():
|
|
12 |
label="Test Set", scale=1)
|
13 |
script_box = gr.Markdown(value=TEST_SCRIPT_TEXT, scale=4)
|
14 |
script_button = gr.File(value=os.path.join("assets/evaluation.py"), label="Test Script", scale=1)
|
15 |
-
gr.Markdown(SUBMIT_TEXT)
|
|
|
12 |
label="Test Set", scale=1)
|
13 |
script_box = gr.Markdown(value=TEST_SCRIPT_TEXT, scale=4)
|
14 |
script_button = gr.File(value=os.path.join("assets/evaluation.py"), label="Test Script", scale=1)
|
15 |
+
gr.Markdown(SUBMIT_TEXT)
|