Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Resume Interrupted Evaluations #2533

Closed
minimi-kei opened this issue Dec 4, 2024 · 1 comment · Fixed by #2534
Closed

Feature Request: Resume Interrupted Evaluations #2533

minimi-kei opened this issue Dec 4, 2024 · 1 comment · Fixed by #2534

Comments

@minimi-kei
Copy link

Description:
Currently, the lm-eval-harness does not support resuming an evaluation from where it was interrupted. This can be particularly challenging for large-scale evaluations or when running on systems prone to interruptions (e.g., preemptible cloud instances). Implementing a feature to resume evaluations would greatly enhance usability and efficiency.

Proposed Feature:
Introduce a mechanism to resume evaluations from the last successfully completed task or batch. This could involve:

  1. Periodically saving the evaluation state (e.g., tasks completed, intermediate results).
  2. Allowing users to specify a "resume" flag or load the state automatically.
  3. Ensuring compatibility with multi-task evaluations and avoiding duplicate computations.

Benefits:

  • Saves computational resources by avoiding re-running completed evaluations.
  • Improves user experience for long-running evaluation processes.
  • Facilitates use on unstable or limited-resource environments.

Implementation Suggestions:

  • Use a checkpointing mechanism to save progress in a structured file (e.g., JSON or pickle).
  • Add a command-line argument, such as --resume or --checkpoint-path.
  • Log details of the resumption process for transparency.

Use Case Example:

  1. A user runs a large evaluation task across multiple datasets but the process is interrupted due to a system crash.
  2. Instead of restarting the entire evaluation, the user runs the command again with the --resume flag.
  3. The evaluation seamlessly picks up from where it left off, saving time and resources.

Additional Context:
This feature is commonly available in benchmarking tools and would align lm-eval-harness with user expectations for robust, large-scale evaluations.

@baberabb
Copy link
Contributor

baberabb commented Dec 4, 2024

Hi! we do actually have this implemented using --use_cache <DIR> to cache the model results while evaluating and skip previously evaluated samples on resumption. Caching is rank-dependent though, so restart with the same GPU count if interrupted! Also have --cache_requests so the dataset preprocessing steps can be saved and evaluation can resume quicker.

I should update the README to make these more prominent!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants