Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Google Batch NOT_FOUND error management #5690

Open
wants to merge 24 commits into
base: master
Choose a base branch
from

Conversation

jorgee
Copy link
Contributor

@jorgee jorgee commented Jan 21, 2025

This PR includes a possible fix for the NotFoundException returned by the Google Batch API when getting some tasks.

When the client.getTaskStatus throws a NotFoundException it is caught and managed in the following way:

  • If the retrieved from client.listTasks has several tasks. It tries to find the task and check the status
  • otherwise or if the task is not found in the list it retrieves the job status.
  • If it is not able to neither get the task or job status it returns pending

Including a unit test producing the NotFoundException to validate the logic.

Two corner cases could not be correctly managed.

  1. When neither task status nor a job status is found, it is returning PENDING. I think that it could happen in the initial stage when Google Batch is creating the job and tasks. I am assuming that a task or job status will be received at some point in the execution. It is the same management that we did when no tasks were found in the job.
  2. When a task is not found in Google Batch, it belongs to a task array job and the job status is RUNNING or FAILED, we could get an incorrect task status. It is not very important in the case of RUNNING, but with FAILED we could get a task failed but, its state is unknown. According to the API documentation, the RUNNING or FAILED job states mean that at least one of the tasks is in this state. Job could be FAILED when there has been a job failure (such as the invalid type one we saw in Google Batch run hangs when a job fail to start #5550). So, I am assuming that when the task status is not found, it is not in the list and the Job is FAILED, It is due to a job failure and the task is also FAILED. Other alternatives that I have considered are: set the task to PENDING, but I think we could get a deadlock if it is job failure; or throw an exception to abort the execution. Other suggestions are welcome.

@jorgee jorgee linked an issue Jan 21, 2025 that may be closed by this pull request
Copy link

netlify bot commented Jan 21, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit ccbdca2
🔍 Latest deploy log https://app.netlify.com/sites/nextflow-docs-staging/deploys/67abd5c24f619f0008e6548a

@pditommaso
Copy link
Member

Should @ejseqera run stress test on this?

@ejseqera
Copy link

I'm on it

@ewels

This comment was marked as off-topic.

@jorgee

This comment was marked as off-topic.

@robnewman

This comment was marked as off-topic.

@pditommaso

This comment was marked as off-topic.

@jorgee

This comment was marked as off-topic.

@robnewman

This comment was marked as off-topic.

@jorgee

This comment was marked as off-topic.

@robnewman

This comment was marked as off-topic.

@pditommaso
Copy link
Member

@ejseqera any update on this? let us know if you need assistance for this branch

@ejseqera
Copy link

ejseqera commented Jan 27, 2025

I've attempted a large nf-core/sarek stress test run (~1390 jobs) and ran into several igenomes download issues and more importantly, several java.lang.NullPointerException: Cannot invoke "com.google.common.hash.HashCode.asBytes()" because "hash" is null in the TaskProcessor suggesting there might be an underlying issue with task submission/retry logic. See attached log.

Will update once I have results from the rerun with local reference data but this seems unrelated to the reference data staging errors.

@jorgee
Copy link
Contributor Author

jorgee commented Jan 27, 2025

@ejseqera It as an unrelated issue, the stage of the file is failing and in the retry the hash is null. I am looking why the hash is null in this case.

@bentsherman
Copy link
Member

@jorgee the old implementation of checking job status: #3892

@jorgee
Copy link
Contributor Author

jorgee commented Jan 29, 2025

I have updated the branch with the following changes
Add a flag to indicate if a task belongs to an array, when a task belongs to an array I use the current way to check the status but with a fallback to the job status if it fails. For single tasks, it uses the old job status check.
In PR #5723, I have included an alternative for managing the status of tasks in arrays that reduces the API calls and does not require the specific management of the NotFoundException.

@ejseqera
Copy link

ejseqera commented Jan 30, 2025

@jorgee Should I go ahead and run another test with this updated branch now? Is it worth also testing #5723?

@jorgee
Copy link
Contributor Author

jorgee commented Jan 30, 2025

@jorgee Should I go ahead and run another test with this updated branch now? Is it worth also testing #5723?

Yes, please check with this updated branch. I have also included a change to fix the hash null due to the read timeout. #5723 only affects when using task arrays.

@@ -91,11 +91,13 @@ class TaskArrayCollector {
// submit task directly if the collector is closed
// or if the task is retried (since it might have dynamic resources)
if( closed || task.config.getAttempt() > 1 ) {
task.isChild = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is implicit because boolean is false by default

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have explicitly set to false to ensure it is not treat as an array when task is retried.

@jorgee
Copy link
Contributor Author

jorgee commented Feb 6, 2025

This might be addressable using existing Nextflow configuration settings google.httpReadTimeout and google.httpConnectTimeout so I can retry but let me know if you have any further insights. Log is attached.
KwJ0dUQ6dCkN7.log

@ejseqera It was a temporal domain name resolution problem "UnknownHostException: storage.googleapis.com". The read timeout or connect timeout are not going to fix it. Maybe increase the retry config values google.storage.maxDelay and google.storage.maxAttempts.

@ejseqera
Copy link

ejseqera commented Feb 10, 2025

I ran another large-scale test (57gB7BLSjrLuXj, ~1200 successful tasks) and it looks like the original NotFoundException issue appears to be resolved with this implementation.

During testing, I did encounter some other exceptions, but these are separate from the NotFoundException issue this PR addresses:

  1. DeadlineExceededException from the Batch API after ~1200 successful tasks
  • The error occurred when the gRPC call to batch.googleapis.com exceeded its 60-second deadline. This happened despite having configured a retry policy:
      retryPolicy {
         maxDelay = '60s'
         maxAttempts = 15
         jitter = 0.5
         delay = '5s'
      }
  • This can be mitigated with alternative retry policy configuration
  • Not related to the original task status resolution problem
  1. RejectedExecutionException during pipeline shutdown
    • Related to thread pool management during termination
    • Again, separate from the task status handling

These new findings could be addressed separately through configuration or future PRs if needed, but I don't think they impact the effectiveness of this PR's solution for the NotFoundException issue. What do you think? @pditommaso @jorgee

Logs for two separate runs in different regions attached.
57gB7BLSjrLuXj.log
20EJcGulEDB2GF.log

@pditommaso pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46
pditommaso and others added 8 commits February 10, 2025 22:48
…BatchClient.groovy [ci skip]

Signed-off-by: Paolo Di Tommaso <[email protected]>
Alternative for managing task array status in Google Batch

Signed-off-by: Ben Sherman <[email protected]>
Co-authored-by: Chris Hakkaart <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
@pditommaso pditommaso marked this pull request as ready for review February 10, 2025 22:48
@pditommaso pditommaso requested a review from a team as a code owner February 10, 2025 22:48
@pditommaso
Copy link
Member

@ejseqera it looks like the RejectedExecutionException error is always a side effect DeadlineExceededException. I'll open a separate issue for the first

@@ -24,5 +24,5 @@ import groovy.transform.InheritConstructors
* @author Paolo Di Tommaso <[email protected]>
*/
@InheritConstructors
class ProcessStageException extends ProcessException implements ShowOnlyExceptionMessage {
class ProcessStageException extends ProcessUnrecoverableException implements ShowOnlyExceptionMessage {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the rationale to change this to "unrecoverable" ?

Signed-off-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Paolo Di Tommaso <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

NOT_FOUND error on google-batch
6 participants