Spaces:
Running
on
CPU Upgrade
performance-improvement
Changes to pyproject.toml
:
- corrected
ruff
settings to work with VSCode
Changes to src/envs.py
:
- Removed formatted string literal in the print statement, replacing
f-string
with a plain string for constant messages.
Changes to src/leaderboard/read_evals.py
:
- Class
EvalResult
:- Changed some instance variables to optionally include types or use defaults.
- Replaced
tags
default fromNone
to an empty list usingfield(default_factory=list)
. - Refactored
init_from_json_file
method to handle the new config structure and usecls
instead ofself
. - Extracted result processing into a new method
extract_results
. - Implemented structured error handling and refined the update methods.
- Functionality:
- Redefined how request files are selected and validated using
Pathlib
and added more structured checks. - Enhanced error handling across methods with specific exceptions and logging for errors.
- Redefined how request files are selected and validated using
❗This is a first commit, I'm going to improve existing functionality in the next commits
Do we need to see the list of flagged models from src/leaderboard/filter_models.py
line 144? More no than yes, so I commented it out
Key changes for src/leaderboard/read_evals.py
:
- Replaced the method for sorting JSON files based on datetime embedded in their filenames. The new method uses a list of expected datetime formats to parse these strings, and logs an error if none of the formats match, defaulting to a Unix start time for legacy files with incorrect time formats.
- Introduced error handling during the construction of evaluation results dictionary to log missing keys specifically, improving debugging capabilities.
- Wrapped the iteration over model files with
tqdm
for a visual progress indicator during execution. - Added handling within the logging scope for
tqdm
to ensure progress output and log messages don't conflict, improving the clarity of console output during execution.
Notes
My concern is how will tqdm
behave in an ephemeral space? I need to check
Is this one reviewable? :)
Aha, you can review it @clefourrier , I’ll appreciate it! :3
General comments
- nice system with the exponential backoff
- cool work on the type hinting
- careful, you removed some docstrings
Specific comments
src/leaderboard/filter_models
Feel free to remove the "flagged models" log
src/leaderboard/read_evals
- please revert the change for
result_key
as the new system with the join is considerably less clear to read/edit if needed - truthfulqa and NaNs > could be interesting to set any NaN value to 0, no matter the eval, it will also make the code more readable (but add in comment that it's mostly for truthfulqa)
- l.79: add a comment to explain the system
- add the comments back in
extract_results
- nice exception management in
update_with_request_file
parse_datetime
could go inutils
Following new commits that happened in this PR, the ephemeral Space HuggingFaceH4/open_llm_leaderboard-ci-pr-705 has been updated.
(This is an automated message.)
Following new commits that happened in this PR, the ephemeral Space HuggingFaceH4/open_llm_leaderboard-ci-pr-705 has been updated.
(This is an automated message.)
Don't mind the above commit, it's a WIP one
It manifested! ( @Wauplin this is so random XD)
Finished with the changes, I'm ready to merge!
LGTM!