Add Google Batch NOT_FOUND error management #5690

jorgee · 2025-01-21T19:42:02Z

This PR includes a possible fix for the NotFoundException returned by the Google Batch API when getting some tasks.

When the client.getTaskStatus throws a NotFoundException it is caught and managed in the following way:

If the retrieved from client.listTasks has several tasks. It tries to find the task and check the status
otherwise or if the task is not found in the list it retrieves the job status.
If it is not able to neither get the task or job status it returns pending

Including a unit test producing the NotFoundException to validate the logic.

Two corner cases could not be correctly managed.

When neither task status nor a job status is found, it is returning PENDING. I think that it could happen in the initial stage when Google Batch is creating the job and tasks. I am assuming that a task or job status will be received at some point in the execution. It is the same management that we did when no tasks were found in the job.
When a task is not found in Google Batch, it belongs to a task array job and the job status is RUNNING or FAILED, we could get an incorrect task status. It is not very important in the case of RUNNING, but with FAILED we could get a task failed but, its state is unknown. According to the API documentation, the RUNNING or FAILED job states mean that at least one of the tasks is in this state. Job could be FAILED when there has been a job failure (such as the invalid type one we saw in Google Batch run hangs when a job fail to start #5550). So, I am assuming that when the task status is not found, it is not in the list and the Job is FAILED, It is due to a job failure and the task is also FAILED. Other alternatives that I have considered are: set the task to PENDING, but I think we could get a deadlock if it is job failure; or throw an exception to abort the execution. Other suggestions are welcome.

Signed-off-by: jorgee <[email protected]>

netlify · 2025-01-21T19:42:19Z

✅ Deploy Preview for nextflow-docs-staging canceled.

Name	Link
🔨 Latest commit	`ccbdca2`
🔍 Latest deploy log	https://app.netlify.com/sites/nextflow-docs-staging/deploys/67abd5c24f619f0008e6548a

pditommaso · 2025-01-22T09:18:04Z

Should @ejseqera run stress test on this?

ejseqera · 2025-01-22T14:51:42Z

I'm on it

pditommaso · 2025-01-24T14:02:10Z

@ejseqera any update on this? let us know if you need assistance for this branch

plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy

ejseqera · 2025-01-27T03:22:27Z

I've attempted a large nf-core/sarek stress test run (~1390 jobs) and ran into several igenomes download issues and more importantly, several java.lang.NullPointerException: Cannot invoke "com.google.common.hash.HashCode.asBytes()" because "hash" is null in the TaskProcessor suggesting there might be an underlying issue with task submission/retry logic. See attached log.

Will update once I have results from the rerun with local reference data but this seems unrelated to the reference data staging errors.

jorgee · 2025-01-27T10:18:33Z

@ejseqera It as an unrelated issue, the stage of the file is failing and in the retry the hash is null. I am looking why the hash is null in this case.

bentsherman · 2025-01-27T15:25:58Z

@jorgee the old implementation of checking job status: #3892

… only for array tasks Signed-off-by: jorgee <[email protected]>

jorgee · 2025-01-29T17:36:05Z

I have updated the branch with the following changes
Add a flag to indicate if a task belongs to an array, when a task belongs to an array I use the current way to check the status but with a fallback to the job status if it fails. For single tasks, it uses the old job status check.
In PR #5723, I have included an alternative for managing the status of tasks in arrays that reduces the API calls and does not require the specific management of the NotFoundException.

plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy

Signed-off-by: jorgee <[email protected]>

ejseqera · 2025-01-30T16:04:48Z

@jorgee Should I go ahead and run another test with this updated branch now? Is it worth also testing #5723?

Signed-off-by: jorgee <[email protected]>

jorgee · 2025-01-30T16:42:01Z

@jorgee Should I go ahead and run another test with this updated branch now? Is it worth also testing #5723?

Yes, please check with this updated branch. I have also included a change to fix the hash null due to the read timeout. #5723 only affects when using task arrays.

plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy

pditommaso · 2025-02-04T11:36:52Z

modules/nextflow/src/main/groovy/nextflow/processor/TaskArrayCollector.groovy

@@ -91,11 +91,13 @@ class TaskArrayCollector {
            // submit task directly if the collector is closed
            // or if the task is retried (since it might have dynamic resources)
            if( closed || task.config.getAttempt() > 1 ) {
+                task.isChild = false


This is implicit because boolean is false by default

I have explicitly set to false to ensure it is not treat as an array when task is retried.

modules/nextflow/src/main/groovy/nextflow/processor/TaskArrayCollector.groovy

Signed-off-by: jorgee <[email protected]>

jorgee · 2025-02-06T14:48:13Z

This might be addressable using existing Nextflow configuration settings google.httpReadTimeout and google.httpConnectTimeout so I can retry but let me know if you have any further insights. Log is attached.
KwJ0dUQ6dCkN7.log

@ejseqera It was a temporal domain name resolution problem "UnknownHostException: storage.googleapis.com". The read timeout or connect timeout are not going to fix it. Maybe increase the retry config values google.storage.maxDelay and google.storage.maxAttempts.

ejseqera · 2025-02-10T13:29:25Z

I ran another large-scale test (57gB7BLSjrLuXj, ~1200 successful tasks) and it looks like the original NotFoundException issue appears to be resolved with this implementation.

During testing, I did encounter some other exceptions, but these are separate from the NotFoundException issue this PR addresses:

DeadlineExceededException from the Batch API after ~1200 successful tasks

The error occurred when the gRPC call to batch.googleapis.com exceeded its 60-second deadline. This happened despite having configured a retry policy:

      retryPolicy {
         maxDelay = '60s'
         maxAttempts = 15
         jitter = 0.5
         delay = '5s'
      }

This can be mitigated with alternative retry policy configuration
Not related to the original task status resolution problem

RejectedExecutionException during pipeline shutdown
- Related to thread pool management during termination
- Again, separate from the task status handling

These new findings could be addressed separately through configuration or future PRs if needed, but I don't think they impact the effectiveness of this PR's solution for the NotFoundException issue. What do you think? @pditommaso @jorgee

Logs for two separate runs in different regions attached.
57gB7BLSjrLuXj.log
20EJcGulEDB2GF.log

…BatchClient.groovy [ci skip] Signed-off-by: Paolo Di Tommaso <[email protected]>

Alternative for managing task array status in Google Batch

Signed-off-by: Ben Sherman <[email protected]> Co-authored-by: Chris Hakkaart <[email protected]> Co-authored-by: Paolo Di Tommaso <[email protected]>

Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso · 2025-02-11T08:52:31Z

@ejseqera it looks like the RejectedExecutionException error is always a side effect DeadlineExceededException. I'll open a separate issue for the first

…sor.groovy [ci skip] Signed-off-by: Paolo Di Tommaso <[email protected]>

…gle-batch-v2 [ci fast] Task array improve

pditommaso · 2025-02-11T22:36:27Z

modules/nextflow/src/main/groovy/nextflow/exception/ProcessStageException.groovy

@@ -24,5 +24,5 @@ import groovy.transform.InheritConstructors
 * @author Paolo Di Tommaso <[email protected]>
 */
 @InheritConstructors
-class ProcessStageException extends ProcessException implements ShowOnlyExceptionMessage {
+class ProcessStageException extends ProcessUnrecoverableException implements ShowOnlyExceptionMessage {


What's the rationale to change this to "unrecoverable" ?

Signed-off-by: Paolo Di Tommaso <[email protected]>

Add not_found error management

e60d4b9

Signed-off-by: jorgee <[email protected]>

jorgee linked an issue Jan 21, 2025 that may be closed by this pull request

NOT_FOUND error on google-batch #5422

Open

This comment was marked as off-topic.

Sign in to view

robnewman mentioned this pull request Jan 23, 2025

Nextflow to gracefully handle Google gRPC API call failures #5703

Closed

pditommaso reviewed Jan 24, 2025

View reviewed changes

plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy Outdated Show resolved Hide resolved

Revert to old job status check for standard tasks and use task status…

959dd2b

… only for array tasks Signed-off-by: jorgee <[email protected]>

pditommaso reviewed Jan 29, 2025

View reviewed changes

plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy Outdated Show resolved Hide resolved

belongsToArray renamed to isChild

9fdae13

Signed-off-by: jorgee <[email protected]>

jorgee mentioned this pull request Jan 30, 2025

NullPointerException when retrying task with an input transfer failure #5727

Open

jorgee closed this Jan 30, 2025

jorgee reopened this Jan 30, 2025

fix NullPointerExceptions when retry task because of staging error

8a99b03

Signed-off-by: jorgee <[email protected]>

pditommaso reviewed Jan 30, 2025

View reviewed changes

plugins/nf-google/src/main/nextflow/cloud/google/batch/GoogleBatchTaskHandler.groovy Outdated Show resolved Hide resolved

pditommaso reviewed Feb 4, 2025

View reviewed changes

modules/nextflow/src/main/groovy/nextflow/processor/TaskArrayCollector.groovy Outdated Show resolved Hide resolved

bentsherman added the executor/google-batch label Feb 4, 2025

pditommaso mentioned this pull request Feb 4, 2025

Handle not found exception in task status check #5648

Open

jorgee added 3 commits February 6, 2025 15:05

Include remove of task in hasmap when completed

9f811de

Signed-off-by: jorgee <[email protected]>

fix isChild missing rename

070932b

Signed-off-by: jorgee <[email protected]>

fix rebase issue

1ee152a

Signed-off-by: jorgee <[email protected]>

pditommaso added 2 commits February 10, 2025 15:01

Merge branch 'master' into 5422-not_found-error-on-google-batch

e78bbef

Update plugins/nf-google/src/main/nextflow/cloud/google/batch/client/…

3abaaf9

…BatchClient.groovy [ci skip] Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso force-pushed the master branch 2 times, most recently from 5a93547 to 27345a6 Compare February 10, 2025 21:46

pditommaso and others added 8 commits February 10, 2025 22:48

Update plugins/nf-google/src/main/nextflow/cloud/google/batch/client/…

20b19a7

…BatchClient.groovy [ci skip] Signed-off-by: Paolo Di Tommaso <[email protected]>

Merge pull request #5723 from nextflow-io/5422-alternavite-task-arrays

195ba32

Alternative for managing task array status in Google Batch

Fix bugs with workflow outputs (#5502)

15672c7

Signed-off-by: Ben Sherman <[email protected]> Co-authored-by: Chris Hakkaart <[email protected]> Co-authored-by: Paolo Di Tommaso <[email protected]>

Fix CI script

7860c91

Signed-off-by: Paolo Di Tommaso <[email protected]>

Bump netty-common:4.1.118.Final

3d98495

Signed-off-by: Paolo Di Tommaso <[email protected]>

Improve CI tests collection

04cd0c2

Signed-off-by: Paolo Di Tommaso <[email protected]>

Fix typo [ci skip]

846cd42

Signed-off-by: Paolo Di Tommaso <[email protected]>

Task array improve

3e03114

Signed-off-by: Paolo Di Tommaso <[email protected]>

pditommaso marked this pull request as ready for review February 10, 2025 22:48

pditommaso requested a review from a team as a code owner February 10, 2025 22:48

pditommaso added 3 commits February 11, 2025 09:58

Update modules/nextflow/src/main/groovy/nextflow/processor/TaskProces…

edb4c9f

…sor.groovy [ci skip] Signed-off-by: Paolo Di Tommaso <[email protected]>

Merge pull request #5776 from nextflow-io/5422-not_found-error-on-goo…

5cf9c44

…gle-batch-v2 [ci fast] Task array improve

Merge branch 'master' into 5422-not_found-error-on-google-batch

f878f9d

pditommaso reviewed Feb 11, 2025

View reviewed changes

pditommaso added 2 commits February 11, 2025 23:40

Minor changes

40c5f58

Signed-off-by: Paolo Di Tommaso <[email protected]>

Fix failing test

ccbdca2

Signed-off-by: Paolo Di Tommaso <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Google Batch NOT_FOUND error management #5690

Add Google Batch NOT_FOUND error management #5690

jorgee commented Jan 21, 2025 •

edited

Loading

netlify bot commented Jan 21, 2025 •

edited

Loading

pditommaso commented Jan 22, 2025

ejseqera commented Jan 22, 2025

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

pditommaso commented Jan 24, 2025

ejseqera commented Jan 27, 2025 •

edited

Loading

jorgee commented Jan 27, 2025

bentsherman commented Jan 27, 2025

jorgee commented Jan 29, 2025 •

edited

Loading

ejseqera commented Jan 30, 2025 •

edited

Loading

jorgee commented Jan 30, 2025

pditommaso Feb 4, 2025

jorgee Feb 4, 2025

jorgee commented Feb 6, 2025 •

edited

Loading

ejseqera commented Feb 10, 2025 •

edited

Loading

pditommaso commented Feb 11, 2025

pditommaso Feb 11, 2025

Add Google Batch NOT_FOUND error management #5690

Are you sure you want to change the base?

Add Google Batch NOT_FOUND error management #5690

Conversation

jorgee commented Jan 21, 2025 • edited Loading

netlify bot commented Jan 21, 2025 • edited Loading

✅ Deploy Preview for nextflow-docs-staging canceled.

pditommaso commented Jan 22, 2025

ejseqera commented Jan 22, 2025

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

pditommaso commented Jan 24, 2025

ejseqera commented Jan 27, 2025 • edited Loading

jorgee commented Jan 27, 2025

bentsherman commented Jan 27, 2025

jorgee commented Jan 29, 2025 • edited Loading

ejseqera commented Jan 30, 2025 • edited Loading

jorgee commented Jan 30, 2025

pditommaso Feb 4, 2025

Choose a reason for hiding this comment

jorgee Feb 4, 2025

Choose a reason for hiding this comment

jorgee commented Feb 6, 2025 • edited Loading

ejseqera commented Feb 10, 2025 • edited Loading

pditommaso commented Feb 11, 2025

pditommaso Feb 11, 2025

Choose a reason for hiding this comment

jorgee commented Jan 21, 2025 •

edited

Loading

netlify bot commented Jan 21, 2025 •

edited

Loading

ejseqera commented Jan 27, 2025 •

edited

Loading

jorgee commented Jan 29, 2025 •

edited

Loading

ejseqera commented Jan 30, 2025 •

edited

Loading

jorgee commented Feb 6, 2025 •

edited

Loading

ejseqera commented Feb 10, 2025 •

edited

Loading