Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

Class EvalResult:
- Changed some instance variables to optionally include types or use defaults.
- Replaced tags default from None to an empty list using field(default_factory=list).
- Refactored init_from_json_file method to handle the new config structure and use cls instead of self.
- Extracted result processing into a new method extract_results.
- Implemented structured error handling and refined the update methods.
Functionality:
- Redefined how request files are selected and validated using Pathlib and added more structured checks.
- Enhanced error handling across methods with specific exceptions and logging for errors.

❗This is a first commit, I'm going to improve existing functionality in the next commits

Replaced the method for sorting JSON files based on datetime embedded in their filenames. The new method uses a list of expected datetime formats to parse these strings, and logs an error if none of the formats match, defaulting to a Unix start time for legacy files with incorrect time formats.
Introduced error handling during the construction of evaluation results dictionary to log missing keys specifically, improving debugging capabilities.
Wrapped the iteration over model files with tqdm for a visual progress indicator during execution.
Added handling within the logging scope for tqdm to ensure progress output and log messages don't conflict, improving the clarity of console output during execution.

My concern is how will tqdm behave in an ephemeral space? I need to check

alozowski changed pull request status to open Apr 29

Open LLM Leaderboard org Apr 29

Is this one reviewable? :)

Open LLM Leaderboard org Apr 29

Aha, you can review it @clefourrier , I’ll appreciate it! :3

Open LLM Leaderboard org Apr 29

@Wauplin trying to tag you to get the ephemeral space to manifest again XD

Open LLM Leaderboard org Apr 29

General comments

Feel free to remove the "flagged models" log

please revert the change for result_key as the new system with the join is considerably less clear to read/edit if needed
truthfulqa and NaNs > could be interesting to set any NaN value to 0, no matter the eval, it will also make the code more readable (but add in comment that it's mostly for truthfulqa)
l.79: add a comment to explain the system
add the comments back in extract_results
nice exception management in update_with_request_file
parse_datetime could go in utils

Open LLM Leaderboard org Apr 30

Following new commits that happened in this PR, the ephemeral Space HuggingFaceH4/open_llm_leaderboard-ci-pr-705 has been updated.
(This is an automated message.)

Open LLM Leaderboard org Apr 30

Following new commits that happened in this PR, the ephemeral Space HuggingFaceH4/open_llm_leaderboard-ci-pr-705 has been updated.
(This is an automated message.)

Open LLM Leaderboard org Apr 30

•

Don't mind the above commit, it's a WIP one

Open LLM Leaderboard org Apr 30

It manifested! ( @Wauplin this is so random XD)

Open LLM Leaderboard org May 6

Finished with the changes, I'm ready to merge!