TITLE = """

🧊🌊ANGO Leaderboard

""" INTRODUCTION_TEXT = """ ANGO is A Novel Generation-Oriented Chinese LLM evaluation benchmark. We introduces the format of single-question multiple-keypoints dataset for the first time, which include 171 keypoints accumulated in 4 hierarchical levels and 9 difficulty categories. The data were exclusively obtained from the Administrative Proficiency Test, which serves as a significant component of the Chinese civil service examination. We will apply a seasonal system for the leaderboard, updating them every two months. The corresponding test dataset will be announced at the beginning of each season, and some questions will be eliminated at the end of the season. Read more details in "About" page! """ KEYPOINT_TEXT = """ Because single question may contains more than one keypoint, so the total number of keypoint count is higher than question count """ KEYPOINT_DISTRIBUTION = """{"data":[{"branchvalues":"total","insidetextorientation":"radial","labels":["关联词-转折","关联词-因果","关联词-对策","关联词-并列","主题词","程度词","行文脉络-总分","行文脉络-分总","行文脉络-分总分","特殊问法","实词","代词","首句特征","非首句特征","确定捆绑","确定顺序","尾句特征","开头","中间","结尾","词的辨析-词义侧重","词的辨析-固定搭配","词的辨析-感情色彩","词的辨析-程度轻重","关联关系-转折关系","关联关系-因果关系","关联关系-并列关系","对应关系-解释类对应","对应关系-重点词句对应","给完工时间型","给效率比例型","给具体单位型","工程问题-其他","非典型最值问题","构造数列","最不利构造","多集合反向构造","周期相遇问题","周期余数问题","周期问题-其他","火车过桥","平均速度","普通行程","相遇追及","流水行船","行程问题-其他","平面几何","立体几何","两集合","三集合","基础排列组合","相邻问题","不相邻问题","同素分堆问题","环形排列问题","错位排列","排列组合问题-其他","给情况求概率","给概率求概率","概率问题-其他","普通不定方程","不定方程组","主客体","大前提","方式目的","原因结果","单定义-其他句式","故事类","拆词","常规问法","搭桥","必要条件","补充论据","加强选非题","加强-其他","削弱论点","拆桥","他因削弱","削弱选非题","削弱论据","因果倒置","削弱-其他","常规翻译","集合推理","推理形式","翻译推理-其他","语义关系-近义关系","语义关系-反义关系","语义-其他","逻辑关系-全同关系","逻辑关系-并列关系","逻辑关系-交叉关系","逻辑关系-包容关系","逻辑关系-对应关系","中心理解题","细节判断题","词句理解题","标题填入题","语句排序题","语句填空题","接语选择题","实词填空","成语填空","混搭填空","词的辨析","语境分析","工程问题","最值问题","年龄问题","和差倍比问题","周期问题","数列问题","行程问题","几何问题","容斥原理问题","排列组合问题","概率问题","经济利润问题","不定方程问题","统筹规划问题","数学运算-其他","公倍数与公约数问题","单定义","多定义","加强题型","削弱题型","翻译推理","组合排列-材料","原因解释","语义关系","逻辑关系","拆分思维","直接找数","简单加减计算","排序类","基期计算","现期计算","基期比较","间隔基期","基期和差","现期追赶","一般增长率","混合增长率","间隔增长率","年均增长率","增长量计算","增长量比较","间隔增长量","年均增长量","现期比重","基期比重","两期比重","混合比重","基期平均数","现期平均数","平均数的增长率","平均数的增长量","两期平均数比较","基期倍数","现期倍数","比值计算","比值比较","时政","中国特色社会主义建设","宏观经济与调控政策","物理常识","化学常识","生物常识","科技理论与成就","生活常识","中国历史","世界历史","文学常识","文化常识","自然常识","国情社情","宪法","行政法","民法","刑法","劳动法","其他法律法规","民事诉讼法","经济法","阅读理解","语句表达","逻辑填空","数学运算","定义判断","逻辑判断","类比推理","文字资料","综合资料","简单计算","基期与现期","增长率","增长量","比重问题","平均数问题","倍数与比值相关","综合分析","政治常识","经济常识","科技常识","人文常识","地理国情","法律常识","未分类","言语理解与表达","数量关系","判断推理","资料分析","常识判断"],"marker":{"colors":["#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#B22222","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC6600","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#CC9900","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#B22222","#B22222","#B22222","#CC6600","#CC9900","#CC9900","#CC9900","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#228B22","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#0077BE","#9400D3","#B22222","#CC6600","#CC9900","#228B22","#0077BE"]},"parents":["中心理解题","中心理解题","中心理解题","中心理解题","中心理解题","中心理解题","中心理解题","中心理解题","中心理解题","中心理解题","词句理解题","词句理解题","语句排序题","语句排序题","语句排序题","语句排序题","语句排序题","语句填空题","语句填空题","语句填空题","词的辨析","词的辨析","词的辨析","词的辨析","语境分析","语境分析","语境分析","语境分析","语境分析","工程问题","工程问题","工程问题","工程问题","最值问题","最值问题","最值问题","最值问题","周期问题","周期问题","周期问题","行程问题","行程问题","行程问题","行程问题","行程问题","行程问题","几何问题","几何问题","容斥原理问题","容斥原理问题","排列组合问题","排列组合问题","排列组合问题","排列组合问题","排列组合问题","排列组合问题","排列组合问题","概率问题","概率问题","概率问题","不定方程问题","不定方程问题","单定义","单定义","单定义","单定义","单定义","单定义","单定义","多定义","加强题型","加强题型","加强题型","加强题型","加强题型","削弱题型","削弱题型","削弱题型","削弱题型","削弱题型","削弱题型","削弱题型","翻译推理","翻译推理","翻译推理","翻译推理","语义关系","语义关系","语义关系","逻辑关系","逻辑关系","逻辑关系","逻辑关系","逻辑关系","阅读理解","阅读理解","阅读理解","阅读理解","语句表达","语句表达","语句表达","逻辑填空","逻辑填空","逻辑填空","逻辑填空","逻辑填空","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","数学运算","定义判断","定义判断","逻辑判断","逻辑判断","逻辑判断","逻辑判断","逻辑判断","类比推理","类比推理","类比推理","简单计算","简单计算","简单计算","基期与现期","基期与现期","基期与现期","基期与现期","基期与现期","基期与现期","增长率","增长率","增长率","增长率","增长量","增长量","增长量","增长量","比重问题","比重问题","比重问题","比重问题","平均数问题","平均数问题","平均数问题","平均数问题","平均数问题","倍数与比值相关","倍数与比值相关","倍数与比值相关","倍数与比值相关","政治常识","政治常识","经济常识","科技常识","科技常识","科技常识","科技常识","科技常识","人文常识","人文常识","人文常识","人文常识","地理国情","地理国情","法律常识","法律常识","法律常识","法律常识","法律常识","法律常识","法律常识","法律常识","言语理解与表达","言语理解与表达","言语理解与表达","数量关系","判断推理","判断推理","判断推理","资料分析","资料分析","资料分析","资料分析","资料分析","资料分析","资料分析","资料分析","资料分析","资料分析","常识判断","常识判断","常识判断","常识判断","常识判断","常识判断","","","","","",""],"values":[892,340,1028,634,1029,211,649,1130,409,629,193,153,110,139,659,560,38,234,417,295,1116,3837,801,808,662,378,1371,2173,4832,162,203,149,51,339,154,111,20,80,103,32,22,38,211,322,75,14,230,183,124,157,373,51,41,29,16,18,23,304,108,36,125,126,266,433,1148,521,1300,118,209,525,582,308,598,220,8,708,226,110,155,90,81,5,708,133,325,36,210,178,117,113,761,278,873,2087,6957,2221,346,465,1506,946,750,3340,2396,2474,6562,9416,565,624,169,1063,215,216,682,413,281,551,448,565,251,163,19,63,3995,525,1716,1375,1202,708,525,505,4112,240,105,118,52,152,24,18,22,61,7,147,50,41,2,113,34,4,2,244,120,91,2,35,94,53,7,3,50,64,32,1,3751,247,433,614,362,687,627,631,737,124,916,1087,568,629,347,669,513,309,75,641,69,105,9989,3202,24188,6288,4520,5526,4857,2168,1,275,284,240,153,457,192,147,441,3999,435,2921,2866,1198,2728,15907,37379,6288,14903,4358,14147],"type":"sunburst"}],"layout":{"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}}}}""" DIFFICULTY_DISTRIBUTION = """{"data":[{"marker":{"color":[24,130,9283,18231,23734,10120,9546,69,12],"colorbar":{"title":{"text":"Total"}},"colorscale":[[0.0,"#440154"],[0.1111111111111111,"#482878"],[0.2222222222222222,"#3e4989"],[0.3333333333333333,"#31688e"],[0.4444444444444444,"#26828e"],[0.5555555555555556,"#1f9e89"],[0.6666666666666666,"#35b779"],[0.7777777777777778,"#6ece58"],[0.8888888888888888,"#b5de2b"],[1.0,"#fde725"]]},"x":[1,2,3,4,5,6,7,8,9],"y":[24,130,9283,18231,23734,10120,9546,69,12],"type":"bar"}],"layout":{"template":{"data":{"histogram2dcontour":[{"type":"histogram2dcontour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"choropleth":[{"type":"choropleth","colorbar":{"outlinewidth":0,"ticks":""}}],"histogram2d":[{"type":"histogram2d","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmap":[{"type":"heatmap","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"heatmapgl":[{"type":"heatmapgl","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"contourcarpet":[{"type":"contourcarpet","colorbar":{"outlinewidth":0,"ticks":""}}],"contour":[{"type":"contour","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"surface":[{"type":"surface","colorbar":{"outlinewidth":0,"ticks":""},"colorscale":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]]}],"mesh3d":[{"type":"mesh3d","colorbar":{"outlinewidth":0,"ticks":""}}],"scatter":[{"fillpattern":{"fillmode":"overlay","size":10,"solidity":0.2},"type":"scatter"}],"parcoords":[{"type":"parcoords","line":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolargl":[{"type":"scatterpolargl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"bar":[{"error_x":{"color":"#2a3f5f"},"error_y":{"color":"#2a3f5f"},"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"bar"}],"scattergeo":[{"type":"scattergeo","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterpolar":[{"type":"scatterpolar","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"histogram":[{"marker":{"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"histogram"}],"scattergl":[{"type":"scattergl","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatter3d":[{"type":"scatter3d","line":{"colorbar":{"outlinewidth":0,"ticks":""}},"marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattermapbox":[{"type":"scattermapbox","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scatterternary":[{"type":"scatterternary","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"scattercarpet":[{"type":"scattercarpet","marker":{"colorbar":{"outlinewidth":0,"ticks":""}}}],"carpet":[{"aaxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"baxis":{"endlinecolor":"#2a3f5f","gridcolor":"white","linecolor":"white","minorgridcolor":"white","startlinecolor":"#2a3f5f"},"type":"carpet"}],"table":[{"cells":{"fill":{"color":"#EBF0F8"},"line":{"color":"white"}},"header":{"fill":{"color":"#C8D4E3"},"line":{"color":"white"}},"type":"table"}],"barpolar":[{"marker":{"line":{"color":"#E5ECF6","width":0.5},"pattern":{"fillmode":"overlay","size":10,"solidity":0.2}},"type":"barpolar"}],"pie":[{"automargin":true,"type":"pie"}]},"layout":{"autotypenumbers":"strict","colorway":["#636efa","#EF553B","#00cc96","#ab63fa","#FFA15A","#19d3f3","#FF6692","#B6E880","#FF97FF","#FECB52"],"font":{"color":"#2a3f5f"},"hovermode":"closest","hoverlabel":{"align":"left"},"paper_bgcolor":"white","plot_bgcolor":"#E5ECF6","polar":{"bgcolor":"#E5ECF6","angularaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"radialaxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"ternary":{"bgcolor":"#E5ECF6","aaxis":{"gridcolor":"white","linecolor":"white","ticks":""},"baxis":{"gridcolor":"white","linecolor":"white","ticks":""},"caxis":{"gridcolor":"white","linecolor":"white","ticks":""}},"coloraxis":{"colorbar":{"outlinewidth":0,"ticks":""}},"colorscale":{"sequential":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"sequentialminus":[[0.0,"#0d0887"],[0.1111111111111111,"#46039f"],[0.2222222222222222,"#7201a8"],[0.3333333333333333,"#9c179e"],[0.4444444444444444,"#bd3786"],[0.5555555555555556,"#d8576b"],[0.6666666666666666,"#ed7953"],[0.7777777777777778,"#fb9f3a"],[0.8888888888888888,"#fdca26"],[1.0,"#f0f921"]],"diverging":[[0,"#8e0152"],[0.1,"#c51b7d"],[0.2,"#de77ae"],[0.3,"#f1b6da"],[0.4,"#fde0ef"],[0.5,"#f7f7f7"],[0.6,"#e6f5d0"],[0.7,"#b8e186"],[0.8,"#7fbc41"],[0.9,"#4d9221"],[1,"#276419"]]},"xaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"yaxis":{"gridcolor":"white","linecolor":"white","ticks":"","title":{"standoff":15},"zerolinecolor":"white","automargin":true,"zerolinewidth":2},"scene":{"xaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"yaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2},"zaxis":{"backgroundcolor":"#E5ECF6","gridcolor":"white","linecolor":"white","showbackground":true,"ticks":"","zerolinecolor":"white","gridwidth":2}},"shapedefaults":{"line":{"color":"#2a3f5f"}},"annotationdefaults":{"arrowcolor":"#2a3f5f","arrowhead":0,"arrowwidth":1},"geo":{"bgcolor":"white","landcolor":"#E5ECF6","subunitcolor":"white","showland":true,"showlakes":true,"lakecolor":"white"},"title":{"x":0.05},"mapbox":{"style":"light"}}},"yaxis":{"type":"log"}}}""" TEST_SET_TEXT = """ The test set comprises a total of 1768 records. Among these records, there are 988 distinct combinations of Keypoints, which indicates the provision of an additional few-shot examples amounting to 988 * 5. The test set encompasses all 171 Keypoint categories. For further information, please refer to the "About" page. """ TEST_SCRIPT_TEXT = """
The evaluation script requires three mandatory arguments, while the others should remain unchanged. --model_path: specifies the location where the model parameters are saved. --dataset_path: indicates the directory where the ANGO test set data is stored. --save_path: denotes the path where the evaluation results will be saved. You can modify the specific functions to adapt them to your model.
Upon completion of the evaluation, the script will generate three files: acc_result: This file contains the predicted results for each record, along with statistical data at the question level. category_result: This file provides statistical data at the Keypoint level. difficulty_result: This file includes statistical data categorized by difficulty level. """ SUBMIT_TEXT = """ Need More Resource """ ABOUT_HTML = """

What is ANGO

We introduce a novel Chinese LLM benchmark dataset called ANGO, aiming to provide more in-depth guidance for model training and evaluation. We introduce the format of a single-question multiple-keypoints dataset for the first time, which will provide the most complete description for each question, enabling the test results to comprehensively showcase the model's performance from multiple perspectives. Based on the single-question multiple-keypoints format, we design a more detailed and refined model capability classification system - the Keypoint Tree, which reflects the relationships between different keypoints. It includes a total of 171 specific model capabilities accumulated in 4 hierarchical levels. With the help of the KeyPoint Tree, the performance of models on multiple levels of capabilities can be quickly measured, and corresponding adjustments can be made. ANGO also involves two new question attributes: human accuracy and human error-prone options. Based on human accuracy, we propose a more detailed difficulty classification compared to previous benchmarks. By combining the human accuracy of the question itself, the human accuracy of the involved key points, and the actual score of the question, all questions are divided into 9 difficulty levels, providing a quantifiable reference for evaluating models of different difficulty.

In addition to the innovative data, we propose a complete set of verification processes tailored for ANGO, which can provide fairer results compared to the current leaderboards. This includes conducting multiple experiments with option shuffling to mitigate the issue of data leakage, designing test set sampling strategies that fully utilize the characteristics of ANGO, and implementing elimination mechanisms for high-accuracy questions. Based on these, we establish a dynamic updating system for the test set, resembling a seasonal system. Thanks to these methods, ANGO can continually update the test results, ensuring the fairness and effectiveness of the leaderboard. By preserving the test results from multiple seasons, it can provide researchers with an overview of the current trends in optimizing models within the community.

Data Source

The data utilized in our study were exclusively obtained from the Administrative Proficiency Test, which serves as a significant component of the Chinese civil service examination.

The Administrative Proficiency Test is entirely composed of multiple-choice questions and aims to evaluate the abilities and skills necessary for practical administrative work. This test covers a wide range of knowledge areas, including Expression& Comprehension , Data Analysis, Quantitative Relations, Judgement&Inference, and Common Knowledge. As a comprehensive assessment tool, it requires candidates to respond to a series of questions related to administrative work within a limited timeframe. These questions may involve policy formulation, problem-solving, personnel and resource management, as well as handling emergency situations. By formulating these questions, it facilitates the evaluation of candidates' analytical thinking, Judgement&Inference, problem-solving abilities, and language proficiency.

The nature of the Administrative Proficiency Test necessitates candidates to tackle complex questions within a specified timeframe, making it an ideal testing environment for assessing the language capabilities of language models. Language models typically demonstrate excellent performance in generating and comprehending text, and this test provides concrete and intricate contexts that simulate real-world language communication and decision-making processes. By employing language models to answer these questions, we can evaluate their understanding of complex problems, Judgement&Inference abilities, as well as the accuracy and fluency of their language expressions.

Furthermore, the Administrative Proficiency Test encompasses a broad coverage and diversity. It includes questions and scenarios from various administrative domains, such as government administration, social affairs, and economic development. This diversity aids in evaluating the language processing abilities of language models across different fields, thereby providing a more comprehensive understanding of their potential strengths and limitations in practical applications. Moreover, it offers valuable insights for future model improvements and applications.

ANGO's data covers all 34 provinces in China and includes three different types of examinations conducted between 2008 and 2023, including formal and mock exams.

Data Processing

In order to enhance the quality of our data, we employed a simple yet efficient preprocessing approach.

Duplicate Removal

Given that mock exams often include previous exam questions, our data contained numerous duplicates. To address this issue, we employed a straightforward strategy of removing duplicates based on the record ID obtained from the data source. As a result of this step, the size of our data was reduced to 88,799 instances.

Image Removal

The data consisted of two types of images: formula pictures and other types (such as images containing graphics). However, since our primary focus was on Chinese Natural Language Processing (NLP) evaluation rather than the multi-modal domain, we opted to remove all records containing pure images. This resulted in the removal of 17,650 records.

Formula Replacement

As mentioned earlier, our data still contained formula pictures, and we recognized the importance of including formulae to ensure diversity in our data. To address this, we extracted 8,144 unique formula images from a pool of 34,062 LaTeX formulas derived from 5,574 questions. These images were then processed using a Formula OCR (Optical Character Recognition) model, followed by manual verification to ensure formula accuracy. Ultimately, we obtained a clean data consisting of 71,149 instances.

Data Format

Here is an example record:

Question: Forward: Backward
Material: Please select the option that best resembles the relationship between the given words or phrases in the question stem.
Type: Single Choice
Options:
A. Urge: Advise
B. Ocean: Land
C. Vibration: Quiet
D. Extend: Compress
Choice: D
Difficulty: 4
KeyPoints: Semantic Relationship - Antonym
Human Accuracy: 79.564999
Human Count: 183494
Most Wrong: C
Solution: Step 1: Determine the logical relationship between the words in the question stem. The two words in the question stem are antonyms. Step 2: Determine the logical relationship between the options. The option that has the same logical relationship as the question stem is option D. Option A is a synonym relationship, option B is a parallel relationship, and in option C, the antonym of "quiet" should be "noisy" instead of "vibration". Therefore, the correct answer is D.
Source: 2011 Jiangsu Province Civil Service Recruitment Examination 'Administrative Aptitude Test' (Category A), Question 41
Formulas: 0

Evaluation(Not Implement Yet)

To mitigate the impact of data leakage during model pretraining on benchmark evaluations, we have employed multiple benchmark evaluation tricks to enhance fairness and real-time performance of the benchmarks.

Confusion of Options Order

Sometimes, a model's correct answer to a specific question may not be due to mastering a certain ability or understanding the question, but rather because it has recognized patterns of token order in the training data. By shuffling the order of options in multiple-choice questions and making multiple predictions with the correct answer placed in different options, we can average the results to reduce the model's reliance on character order.

Season For Dynamic Evaluation

Thanks to sampling strategies optimized for ANGO, we can periodically sample the test set and update the leaderboard. This prevents certain institutions or individuals from maliciously hacking ANGO to inflate the model's performance. However, due to the limited number of questions in some key areas, dynamic iteration may not be feasible for all questions.

Question Elimination Mechanism

In addition to the aforementioned dynamic updating of season, a new question elimination mechanism has been proposed. This mechanism calculates the average accuracy of each question across all models for each iteration. Questions with accuracies exceeding a threshold are temporarily removed by ANGO to ensure reliable discrimination among questions in ANGO.

"""