You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Description:
Currently, the lm-eval-harness does not support resuming an evaluation from where it was interrupted. This can be particularly challenging for large-scale evaluations or when running on systems prone to interruptions (e.g., preemptible cloud instances). Implementing a feature to resume evaluations would greatly enhance usability and efficiency.
Proposed Feature:
Introduce a mechanism to resume evaluations from the last successfully completed task or batch. This could involve:
Periodically saving the evaluation state (e.g., tasks completed, intermediate results).
Allowing users to specify a "resume" flag or load the state automatically.
Ensuring compatibility with multi-task evaluations and avoiding duplicate computations.
Benefits:
Saves computational resources by avoiding re-running completed evaluations.
Improves user experience for long-running evaluation processes.
Facilitates use on unstable or limited-resource environments.
Implementation Suggestions:
Use a checkpointing mechanism to save progress in a structured file (e.g., JSON or pickle).
Add a command-line argument, such as --resume or --checkpoint-path.
Log details of the resumption process for transparency.
Use Case Example:
A user runs a large evaluation task across multiple datasets but the process is interrupted due to a system crash.
Instead of restarting the entire evaluation, the user runs the command again with the --resume flag.
The evaluation seamlessly picks up from where it left off, saving time and resources.
Additional Context:
This feature is commonly available in benchmarking tools and would align lm-eval-harness with user expectations for robust, large-scale evaluations.
The text was updated successfully, but these errors were encountered:
Hi! we do actually have this implemented using --use_cache <DIR> to cache the model results while evaluating and skip previously evaluated samples on resumption. Caching is rank-dependent though, so restart with the same GPU count if interrupted! Also have --cache_requests so the dataset preprocessing steps can be saved and evaluation can resume quicker.
I should update the README to make these more prominent!
Description:
Currently, the lm-eval-harness does not support resuming an evaluation from where it was interrupted. This can be particularly challenging for large-scale evaluations or when running on systems prone to interruptions (e.g., preemptible cloud instances). Implementing a feature to resume evaluations would greatly enhance usability and efficiency.
Proposed Feature:
Introduce a mechanism to resume evaluations from the last successfully completed task or batch. This could involve:
Benefits:
Implementation Suggestions:
Use Case Example:
Additional Context:
This feature is commonly available in benchmarking tools and would align lm-eval-harness with user expectations for robust, large-scale evaluations.
The text was updated successfully, but these errors were encountered: