Feature Request: Resume Interrupted Evaluations #2533

minimi-kei · 2024-12-04T07:28:29Z

Description:
Currently, the lm-eval-harness does not support resuming an evaluation from where it was interrupted. This can be particularly challenging for large-scale evaluations or when running on systems prone to interruptions (e.g., preemptible cloud instances). Implementing a feature to resume evaluations would greatly enhance usability and efficiency.

Proposed Feature:
Introduce a mechanism to resume evaluations from the last successfully completed task or batch. This could involve:

Periodically saving the evaluation state (e.g., tasks completed, intermediate results).
Allowing users to specify a "resume" flag or load the state automatically.
Ensuring compatibility with multi-task evaluations and avoiding duplicate computations.

Benefits:

Saves computational resources by avoiding re-running completed evaluations.
Improves user experience for long-running evaluation processes.
Facilitates use on unstable or limited-resource environments.

Implementation Suggestions:

Use a checkpointing mechanism to save progress in a structured file (e.g., JSON or pickle).
Add a command-line argument, such as --resume or --checkpoint-path.
Log details of the resumption process for transparency.

Use Case Example:

A user runs a large evaluation task across multiple datasets but the process is interrupted due to a system crash.
Instead of restarting the entire evaluation, the user runs the command again with the --resume flag.
The evaluation seamlessly picks up from where it left off, saving time and resources.

Additional Context:
This feature is commonly available in benchmarking tools and would align lm-eval-harness with user expectations for robust, large-scale evaluations.

baberabb · 2024-12-04T09:33:06Z

Hi! we do actually have this implemented using --use_cache <DIR> to cache the model results while evaluating and skip previously evaluated samples on resumption. Caching is rank-dependent though, so restart with the same GPU count if interrupted! Also have --cache_requests so the dataset preprocessing steps can be saved and evaluation can resume quicker.

I should update the README to make these more prominent!

baberabb mentioned this issue Dec 4, 2024

Update README.md #2534

Merged

baberabb closed this as completed in #2534 Dec 4, 2024

minimi-kei mentioned this issue Dec 6, 2024

Inquiry about the feature to continue evaluation after abnormal termination #2548

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: Resume Interrupted Evaluations #2533

Feature Request: Resume Interrupted Evaluations #2533

minimi-kei commented Dec 4, 2024

baberabb commented Dec 4, 2024

Feature Request: Resume Interrupted Evaluations #2533

Feature Request: Resume Interrupted Evaluations #2533

Comments

minimi-kei commented Dec 4, 2024

baberabb commented Dec 4, 2024