Spaces:
Running
Running
Yotam Perlitz
commited on
Commit
•
1138892
1
Parent(s):
acd921a
update images location
Browse files- .gitignore +0 -2
- app.py +3 -3
.gitignore
CHANGED
@@ -1,6 +1,4 @@
|
|
1 |
.vscode/launch.json
|
2 |
.vscode/settings.json
|
3 |
.DS_Store
|
4 |
-
# assets/ablations.png
|
5 |
-
# assets/motivation.png
|
6 |
images/*
|
|
|
1 |
.vscode/launch.json
|
2 |
.vscode/settings.json
|
3 |
.DS_Store
|
|
|
|
|
4 |
images/*
|
app.py
CHANGED
@@ -280,7 +280,7 @@ st.markdown(
|
|
280 |
)
|
281 |
|
282 |
st.image(
|
283 |
-
"
|
284 |
caption="Conclusions depend on the models considered. Kendall-tau correlations between the LMSys Arena benchmark and three other benchmarks: BBH, MMLU, and Alpaca v2. Each group of bars represents the correlation for different sets of top models, specifically the top 5, top 10, and top 15 (overlapping) models (according to the Arena). The results indicate that the degree of agreement between benchmarks varies with the number of top models considered, highlighting that different selections of models can lead to varying conclusions about benchmark agreement.",
|
285 |
use_column_width=True,
|
286 |
)
|
@@ -297,7 +297,7 @@ st.markdown(
|
|
297 |
)
|
298 |
|
299 |
st.image(
|
300 |
-
"
|
301 |
caption="Correlations increase with number of models. Mean correlation (y) between each benchmark (lines) and the rest, given different numbers of models. The Blue and Orange lines are the average of all benchmark pair correlations with models sampled randomly (orange) or in contiguous sets (blue). The shaded lines represents adjacent sampling for the different benchmarks.",
|
302 |
use_column_width=True,
|
303 |
)
|
@@ -316,7 +316,7 @@ st.markdown(
|
|
316 |
|
317 |
|
318 |
st.image(
|
319 |
-
"
|
320 |
caption="Our recommendations substantially reduce the variance of BAT. Ablation analysis for each BAT recommendation separately and their combinations.",
|
321 |
use_column_width=True,
|
322 |
)
|
|
|
280 |
)
|
281 |
|
282 |
st.image(
|
283 |
+
"images/motivation.png",
|
284 |
caption="Conclusions depend on the models considered. Kendall-tau correlations between the LMSys Arena benchmark and three other benchmarks: BBH, MMLU, and Alpaca v2. Each group of bars represents the correlation for different sets of top models, specifically the top 5, top 10, and top 15 (overlapping) models (according to the Arena). The results indicate that the degree of agreement between benchmarks varies with the number of top models considered, highlighting that different selections of models can lead to varying conclusions about benchmark agreement.",
|
285 |
use_column_width=True,
|
286 |
)
|
|
|
297 |
)
|
298 |
|
299 |
st.image(
|
300 |
+
"images/pointplot_granularity_matters.png",
|
301 |
caption="Correlations increase with number of models. Mean correlation (y) between each benchmark (lines) and the rest, given different numbers of models. The Blue and Orange lines are the average of all benchmark pair correlations with models sampled randomly (orange) or in contiguous sets (blue). The shaded lines represents adjacent sampling for the different benchmarks.",
|
302 |
use_column_width=True,
|
303 |
)
|
|
|
316 |
|
317 |
|
318 |
st.image(
|
319 |
+
"images/ablations.png",
|
320 |
caption="Our recommendations substantially reduce the variance of BAT. Ablation analysis for each BAT recommendation separately and their combinations.",
|
321 |
use_column_width=True,
|
322 |
)
|