ango commited on
Commit
20894e3
1 Parent(s): 99a5a4f

update Wrong Hit & Wrong Value part

Browse files
assets/content.py CHANGED
@@ -17,6 +17,10 @@ and some questions will be eliminated at the end of the season.
17
 
18
  Read more details in "About" page!
19
  """
 
 
 
 
20
  KEYPOINT_TEXT = """
21
  Because single question may contains more than one keypoint, so the total number of keypoint count is higher than question count
22
  """
@@ -52,7 +56,7 @@ category_result: This file provides statistical data at the Keypoint level.
52
  difficulty_result: This file includes statistical data categorized by difficulty level.
53
  """
54
  SUBMIT_TEXT = """
55
- You can raise PR in this space to submit your result
56
  """
57
 
58
  ABOUT_HTML = """
@@ -119,6 +123,31 @@ ABOUT_HTML = """
119
  </p>
120
  </div>
121
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
122
  <h1 id="space-title">Evaluation(Not Implement Yet)</h1>
123
  <p>To mitigate the impact of data leakage during model pretraining on benchmark evaluations, we have employed multiple benchmark evaluation tricks to enhance fairness and real-time performance of the benchmarks.</p>
124
 
@@ -127,7 +156,29 @@ ABOUT_HTML = """
127
 
128
  <h4>Season For Dynamic Evaluation</h4>
129
  <p>Thanks to sampling strategies optimized for ANGO, we can periodically sample the test set and update the leaderboard. This prevents certain institutions or individuals from maliciously hacking ANGO to inflate the model's performance. However, due to the limited number of questions in some key areas, dynamic iteration may not be feasible for all questions.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
130
 
 
131
  <h4>Question Elimination Mechanism</h4>
132
  <p>In addition to the aforementioned dynamic updating of season, a new question elimination mechanism has been proposed. This mechanism calculates the average accuracy of each question across all models for each iteration. Questions with accuracies exceeding a threshold are temporarily removed by ANGO to ensure reliable discrimination among questions in ANGO.</p>
133
  """
 
17
 
18
  Read more details in "About" page!
19
  """
20
+ QUESTION_TEXT = r"""
21
+ About Wrong Hit & Wrong Value, pls go to "About" page
22
+ """
23
+
24
  KEYPOINT_TEXT = """
25
  Because single question may contains more than one keypoint, so the total number of keypoint count is higher than question count
26
  """
 
56
  difficulty_result: This file includes statistical data categorized by difficulty level.
57
  """
58
  SUBMIT_TEXT = """
59
+ Need More Resource
60
  """
61
 
62
  ABOUT_HTML = """
 
123
  </p>
124
  </div>
125
 
126
+ <h1 id="space-title">Wrong Hit & Wrong Value</h1>
127
+ <p>There are two special attributes in ANGO:</p>
128
+
129
+ <ul>
130
+ <li>
131
+ <strong>Human Acc:</strong> Refers to the accuracy of humans in this question.
132
+ </li>
133
+ <li>
134
+ <strong>Most Wrong:</strong> Represents the option that humans are prone to get wrong.
135
+ </li>
136
+ </ul>
137
+
138
+ <p>So based on these two attributes, we have derived two new metrics for evaluation:</p>
139
+
140
+ <ul>
141
+ <li>
142
+ <strong>Wrong Hit:</strong> Refers to the number of times the model's incorrect predictions match the options that humans are prone to get wrong.
143
+ </li>
144
+ <li>
145
+ <strong>Wrong Value:</strong> Calculated by taking the average of the human accuracy for all the questions in wrong_hit and subtracting that value from 1.
146
+ </li>
147
+ </ul>
148
+
149
+ <p>Wrong Value and Wrong Hit do not express the model's ability to perfectly solve the problem, but rather to some extent demonstrate the similarity between the model and real humans. Due to intentional guidance or design errors in the questions, humans often exhibit a tendency for widespread errors. In such cases, if the model's predicted answer is similar to the widespread human error tendency, it indicates that the model's way of thinking is closer to that of the majority of ordinary humans.</p>
150
+
151
  <h1 id="space-title">Evaluation(Not Implement Yet)</h1>
152
  <p>To mitigate the impact of data leakage during model pretraining on benchmark evaluations, we have employed multiple benchmark evaluation tricks to enhance fairness and real-time performance of the benchmarks.</p>
153
 
 
156
 
157
  <h4>Season For Dynamic Evaluation</h4>
158
  <p>Thanks to sampling strategies optimized for ANGO, we can periodically sample the test set and update the leaderboard. This prevents certain institutions or individuals from maliciously hacking ANGO to inflate the model's performance. However, due to the limited number of questions in some key areas, dynamic iteration may not be feasible for all questions.</p>
159
+ <p>There are two special attributes in ANGO:</p>
160
+
161
+ <ul>
162
+ <li>
163
+ <strong>Human Acc:</strong> Refers to the accuracy of humans in this question.
164
+ </li>
165
+ <li>
166
+ <strong>Most Wrong:</strong> Represents the option that humans are prone to get wrong.
167
+ </li>
168
+ </ul>
169
+
170
+ <p>So based on these two attributes, we have derived two new metrics for evaluation:</p>
171
+
172
+ <ul>
173
+ <li>
174
+ <strong>Wrong Hit:</strong> Refers to the number of times the model's incorrect predictions match the options that humans are prone to get wrong.
175
+ </li>
176
+ <li>
177
+ <strong>Wrong Value:</strong> Calculated by taking the average of the human accuracy for all the questions in wrong_hit and subtracting that value from 1.
178
+ </li>
179
+ </ul>
180
 
181
+ <p>Wrong Value and Wrong Hit do not express the model's ability to perfectly solve the problem, but rather to some extent demonstrate the similarity between the model and real humans. Due to intentional guidance or design errors in the questions, humans often exhibit a tendency for widespread errors. In such cases, if the model's predicted answer is similar to the widespread human error tendency, it indicates that the model's way of thinking is closer to that of the majority of ordinary humans.</p>
182
  <h4>Question Elimination Mechanism</h4>
183
  <p>In addition to the aforementioned dynamic updating of season, a new question elimination mechanism has been proposed. This mechanism calculates the average accuracy of each question across all models for each iteration. Questions with accuracies exceeding a threshold are temporarily removed by ANGO to ensure reliable discrimination among questions in ANGO.</p>
184
  """
components/result.py CHANGED
@@ -5,7 +5,7 @@ import gradio as gr
5
  import pandas as pd
6
 
7
  from assets.constant import DELIMITER
8
- from assets.content import KEYPOINT_TEXT
9
  from assets.path import SEASON
10
 
11
 
@@ -49,6 +49,7 @@ def build_difficulty(season):
49
 
50
  def create_result(top_components):
51
  with gr.Tab("Question Level"):
 
52
  question_df = gr.DataFrame(build_question("latest"), label="Acc Result")
53
  with gr.Tab("Keypoint Level"):
54
  gr.Markdown(KEYPOINT_TEXT)
 
5
  import pandas as pd
6
 
7
  from assets.constant import DELIMITER
8
+ from assets.content import KEYPOINT_TEXT, QUESTION_TEXT
9
  from assets.path import SEASON
10
 
11
 
 
49
 
50
  def create_result(top_components):
51
  with gr.Tab("Question Level"):
52
+ gr.Markdown(QUESTION_TEXT)
53
  question_df = gr.DataFrame(build_question("latest"), label="Acc Result")
54
  with gr.Tab("Keypoint Level"):
55
  gr.Markdown(KEYPOINT_TEXT)
components/submit.py CHANGED
@@ -12,4 +12,4 @@ def create_submit():
12
  label="Test Set", scale=1)
13
  script_box = gr.Markdown(value=TEST_SCRIPT_TEXT, scale=4)
14
  script_button = gr.File(value=os.path.join("assets/evaluation.py"), label="Test Script", scale=1)
15
- gr.Markdown(SUBMIT_TEXT)
 
12
  label="Test Set", scale=1)
13
  script_box = gr.Markdown(value=TEST_SCRIPT_TEXT, scale=4)
14
  script_button = gr.File(value=os.path.join("assets/evaluation.py"), label="Test Script", scale=1)
15
+ gr.Markdown(SUBMIT_TEXT)