Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Batch should comply with --Error is ignored flag #3166

Closed
zyosufzai opened this issue Aug 31, 2022 · 9 comments
Closed

Google Batch should comply with --Error is ignored flag #3166

zyosufzai opened this issue Aug 31, 2022 · 9 comments

Comments

@zyosufzai
Copy link

Bug report

Google Batch stops running even though nextflow has supplied a --Error is ignored flag

Expected behavior and actual behavior

Google Batch should continue running even if one of the jobs fails if it has been told to ignore the error. But what actually happens is the workflow terminates with a error message (see below)

Steps to reproduce the problem

Input the following on the command line:

NXF_VER="22.08.1-edge" ./nextflow run nf-core/methylseq -r 1.6.1 -c nextflow.config -profile test,gbatch

with the following config file:

gbatch{ 
      process.executor = 'google-batch' 
      process.machineType = 'n2-standard-16' 
      process.time = '2h' 
      workDir = 'gs://nextflowdemobucket/zy-test/testrna_gbatch_tmp' 
      google.location = 'us-central1' 
      google.region  = 'us-central1' 
      google.project = ''
      params.outdir = 'gs://nextflowdemobucket/zy-test/testrna_gbatch'
      google.batch.bootDiskSize = 100.GB
      }

Program output

[2e/9436f4] NOTE: Process preseq (SRR389222_sub1) terminated with an error exit status (139) -- Error is ignored
Error executing process > 'preseq (SRR389222_sub2)'
Caused by:
  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@7426a203[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@6126b8c[Wrapped task = TrustedListenableFutureTask@d542ca6[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@464ab2c1]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@61fde900[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

Environment

  • Nextflow version: 22.08.1-edge
  • Java version: 17.0.3-internal
  • Operating system: Linux
  • Bash version: 5.0.3(1)-release

Additional context

(Add any other context about the problem here)

@zyosufzai zyosufzai changed the title Google Batch doesn't comply with --Error is ignored flag Google Batch should comply with --Error is ignored flag Aug 31, 2022
@bentsherman
Copy link
Member

Can you provide the full log? There are two different tasks, SRR389222_sub1 and SRR389222_sub2, and the first one is ignored but the second one is triggering workflow termination. Need to see what happened with the second one.

@zyosufzai
Copy link
Author

zyosufzai commented Aug 31, 2022

Since the log are broken up by tags I have the log file for the tag [2e/9436f4] and also a trace back (attatched)
zy-test_testm_gbatch_tmp_2e_9436f428d4c0b830b66ed8bc37c994_.command.log
batch-trace.txt

@zyosufzai
Copy link
Author

Sorry the first one log file I added was for SRR389222_sub1. This is the second log file for SRR389222_sub2.
zy-test_testm_gbatch_tmp_bf_62b9e1e10833042281259ba9560df5_.command.log

@pditommaso
Copy link
Member

It would be great if you could upload the .nextflow.log file of the failed execution

@zyosufzai
Copy link
Author

gotcha I couldn't find the location of the log file of that session but I reran the pipeline and directed the file to a known location. It has the same errors with the same tasks.
nextflow.log

@pditommaso
Copy link
Member

That's a weird error. It looks reported by a thread pool used by the Google SDK

Aug-31 20:35:44.041 [Task monitor] ERROR nextflow.processor.TaskProcessor - Error executing process > 'preseq (SRR389222_sub2)'

Caused by:
  Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@49b3a025[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@2e0ceb8c[Wrapped task = TrustedListenableFutureTask@25c1396d[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@2db57b9a]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@aa6214[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]

java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask@49b3a025[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@2e0ceb8c[Wrapped task = TrustedListenableFutureTask@25c1396d[status=PENDING, info=[task=[running=[NOT STARTED YET], com.google.api.gax.rpc.AttemptCallable@2db57b9a]]]]] rejected from java.util.concurrent.ScheduledThreadPoolExecutor@aa6214[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
	at java.base/java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2065)
	at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:833)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:340)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:562)
	at com.google.common.util.concurrent.MoreExecutors$ScheduledListeningDecorator.schedule(MoreExecutors.java:663)
	at com.google.api.gax.retrying.ScheduledRetryingExecutor.submit(ScheduledRetryingExecutor.java:116)
	at com.google.api.gax.retrying.CallbackChainRetryingFuture$AttemptCompletionListener.handle(CallbackChainRetryingFuture.java:137)
	at com.google.api.gax.retrying.CallbackChainRetryingFuture$AttemptCompletionListener.run(CallbackChainRetryingFuture.java:117)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
	at com.google.api.core.AbstractApiFuture$InternalSettableFuture.setException(AbstractApiFuture.java:94)
	at com.google.api.core.AbstractApiFuture.setException(AbstractApiFuture.java:76)
	at com.google.api.gax.grpc.GrpcExceptionCallable$ExceptionTransformingFuture.onFailure(GrpcExceptionCallable.java:97)
	at com.google.api.core.ApiFutures$1.onFailure(ApiFutures.java:67)
	at com.google.common.util.concurrent.Futures$CallbackListener.run(Futures.java:1132)
	at com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:31)
	at com.google.common.util.concurrent.AbstractFuture.executeListener(AbstractFuture.java:1270)
	at com.google.common.util.concurrent.AbstractFuture.complete(AbstractFuture.java:1038)
	at com.google.common.util.concurrent.AbstractFuture.setException(AbstractFuture.java:808)
	at io.grpc.stub.ClientCalls$GrpcFuture.setException(ClientCalls.java:572)
	at io.grpc.stub.ClientCalls$UnaryStreamToFuture.onClose(ClientCalls.java:542)
	at io.grpc.PartialForwardingClientCallListener.onClose(PartialForwardingClientCallListener.java:39)
	at io.grpc.ForwardingClientCallListener.onClose(ForwardingClientCallListener.java:23)
	at io.grpc.ForwardingClientCallListener$SimpleForwardingClientCallListener.onClose(ForwardingClientCallListener.java:40)
	at com.google.api.gax.grpc.ChannelPool$ReleasingClientCall$1.onClose(ChannelPool.java:535)
	at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:562)
	at io.grpc.internal.ClientCallImpl.access$300(ClientCallImpl.java:70)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:743)
	at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:722)
	at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
	at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:133)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	at java.base/java.lang.Thread.run(Thread.java:833)

@zyosufzai
Copy link
Author

zyosufzai commented Sep 1, 2022

So could it be that because preseq of SRR389222_sub1 and 2 failed and caused termination because the pool thread was not able to reuse the previously created threads to execute new requests? I'm wondering if the 'Ignore Errors' flag doesnt comply with google batch because it needs its own exception rule written in its JSON file? Looking at the documentations in the link below I wonder if there needs to be a "ignore_exit_status" : https://cloud.google.com/batch/docs/reference/rpc/google.cloud.batch.v1

aaronegolden added a commit to aaronegolden/nextflow that referenced this issue Nov 29, 2022
1. Adds the projectId to BatchLogging's LoggingOptions so the logs
will be fetched from the correct project. Previously I got a lot of
"that resource might not exist" responses from the log read requests.
I'm not sure if there's something special about my environment that
causes me to need the projectId in the options, but I think it
couldn't hurt to have the projectId in general.

2. Removes the old paseOutput function, which relied on the STDERR and
STDOUT prefixes in the payload of the log entries. Batch no longer adds
those prefixes so that parsing doesn't work. Now we can distinguish
stderr from stdout by looking at the logEntry's severity.

3. Catches any exception thrown by the log reading code. It could be a
permissions issuee like I was seeing, or a thread pool issuee as in
nextflow-io#3166. Either way,
something going wrong while trying to read task logs should probably
not stop the whole workflow.

With these changes, the nf-core/methylseq workflow runs can be run
using the google-batch executor.
aaronegolden added a commit to aaronegolden/nextflow that referenced this issue Nov 29, 2022
1. Adds the projectId to BatchLogging's LoggingOptions so the logs
will be fetched from the correct project. Previously I got a lot of
"that resource might not exist" responses from the log read requests.
I'm not sure if there's something special about my environment that
causes me to need the projectId in the options, but I think it
couldn't hurt to have the projectId in general.

2. Removes the old paseOutput function, which relied on the STDERR and
STDOUT prefixes in the payload of the log entries. Batch no longer adds
those prefixes so that parsing doesn't work. Now we can distinguish
stderr from stdout by looking at the logEntry's severity.

3. Catches any exception thrown by the log reading code. It could be a
permissions issuee like I was seeing, or a thread pool issuee as in
nextflow-io#3166. Either way,
something going wrong while trying to read task logs should probably
not stop the whole workflow.

With these changes, the nf-core/methylseq workflow can be run with
the google-batch executor.
aaronegolden added a commit to aaronegolden/nextflow that referenced this issue Nov 29, 2022
1. Adds the projectId to BatchLogging's LoggingOptions so the logs
will be fetched from the correct project. Previously I got a lot of
"that resource might not exist" responses from the log read requests.
I'm not sure if there's something special about my environment that
causes me to need the projectId in the options, but I think it
couldn't hurt to have the projectId in general.

2. Removes the old paseOutput function, which relied on the STDERR and
STDOUT prefixes in the payload of the log entries. Batch no longer adds
those prefixes so that parsing doesn't work. Now we can distinguish
stderr from stdout by looking at the logEntry's severity.

3. Catches any exception thrown by the log reading code. It could be a
permissions issuee like I was seeing, or a thread pool issuee as in
nextflow-io#3166. Either way,
something going wrong while trying to read task logs should probably
not stop the whole workflow.

With these changes, the nf-core/methylseq workflow can be run with
the google-batch executor.
aaronegolden added a commit to aaronegolden/nextflow that referenced this issue Dec 2, 2022
1. Adds the projectId to BatchLogging's LoggingOptions so the logs
will be fetched from the correct project. Previously I got a lot of
"that resource might not exist" responses from the log read requests.
I'm not sure if there's something special about my environment that
causes me to need the projectId in the options, but I think it
couldn't hurt to have the projectId in general.

2. Removes the old paseOutput function, which relied on the STDERR and
STDOUT prefixes in the payload of the log entries. Batch no longer adds
those prefixes so that parsing doesn't work. Now we can distinguish
stderr from stdout by looking at the logEntry's severity.

3. Catches any exception thrown by the log reading code. It could be a
permissions issuee like I was seeing, or a thread pool issuee as in
nextflow-io#3166. Either way,
something going wrong while trying to read task logs should probably
not stop the whole workflow.

With these changes, the nf-core/methylseq workflow can be run with
the google-batch executor.
aaronegolden added a commit to aaronegolden/nextflow that referenced this issue Dec 2, 2022
1. Adds the projectId to BatchLogging's LoggingOptions so the logs
will be fetched from the correct project. Previously I got a lot of
"that resource might not exist" responses from the log read requests.
I'm not sure if there's something special about my environment that
causes me to need the projectId in the options, but I think it
couldn't hurt to have the projectId in general.

2. Removes the old parseOutput function, which relied on the STDERR and
STDOUT prefixes in the payload of the log entries. Batch no longer adds
prefixes so that parsing doesn't work. Now we can distinguish stderr
from stdout by looking at the logEntry's severity.

3. Catches any exception thrown by the log reading code. It could be a
permissions issue like I was seeing, or a thread pool issuee as in
nextflow-io#3166. Either way,
something going wrong while trying to read task logs should probably
not stop the whole workflow.

With these changes, the nf-core/methylseq workflow can be run with
the google-batch executor.

Signed-off-by: Aaron Golden <[email protected]>
pditommaso added a commit that referenced this issue Dec 3, 2022
1. Adds the projectId to BatchLogging's LoggingOptions so the logs
will be fetched from the correct project. Previously I got a lot of
"that resource might not exist" responses from the log read requests.
I'm not sure if there's something special about my environment that
causes me to need the projectId in the options, but I think it
couldn't hurt to have the projectId in general.

2. Removes the old parseOutput function, which relied on the STDERR and
STDOUT prefixes in the payload of the log entries. Batch no longer adds
prefixes so that parsing doesn't work. Now we can distinguish stderr
from stdout by looking at the logEntry's severity.

3. Catches any exception thrown by the log reading code. It could be a
permissions issue like I was seeing, or a thread pool issuee as in
#3166. Either way,
something going wrong while trying to read task logs should probably
not stop the whole workflow.

With these changes, the nf-core/methylseq workflow can be run with
the google-batch executor.

Signed-off-by: Aaron Golden <[email protected]>

Signed-off-by: Aaron Golden <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
marcodelapierre pushed a commit to marcodelapierre/nextflow that referenced this issue Dec 5, 2022
1. Adds the projectId to BatchLogging's LoggingOptions so the logs
will be fetched from the correct project. Previously I got a lot of
"that resource might not exist" responses from the log read requests.
I'm not sure if there's something special about my environment that
causes me to need the projectId in the options, but I think it
couldn't hurt to have the projectId in general.

2. Removes the old parseOutput function, which relied on the STDERR and
STDOUT prefixes in the payload of the log entries. Batch no longer adds
prefixes so that parsing doesn't work. Now we can distinguish stderr
from stdout by looking at the logEntry's severity.

3. Catches any exception thrown by the log reading code. It could be a
permissions issue like I was seeing, or a thread pool issuee as in
nextflow-io#3166. Either way,
something going wrong while trying to read task logs should probably
not stop the whole workflow.

With these changes, the nf-core/methylseq workflow can be run with
the google-batch executor.

Signed-off-by: Aaron Golden <[email protected]>

Signed-off-by: Aaron Golden <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
Signed-off-by: Marco De La Pierre <[email protected]>
l-modolo pushed a commit to l-modolo/nextflow that referenced this issue Jan 25, 2023
1. Adds the projectId to BatchLogging's LoggingOptions so the logs
will be fetched from the correct project. Previously I got a lot of
"that resource might not exist" responses from the log read requests.
I'm not sure if there's something special about my environment that
causes me to need the projectId in the options, but I think it
couldn't hurt to have the projectId in general.

2. Removes the old parseOutput function, which relied on the STDERR and
STDOUT prefixes in the payload of the log entries. Batch no longer adds
prefixes so that parsing doesn't work. Now we can distinguish stderr
from stdout by looking at the logEntry's severity.

3. Catches any exception thrown by the log reading code. It could be a
permissions issue like I was seeing, or a thread pool issuee as in
nextflow-io#3166. Either way,
something going wrong while trying to read task logs should probably
not stop the whole workflow.

With these changes, the nf-core/methylseq workflow can be run with
the google-batch executor.

Signed-off-by: Aaron Golden <[email protected]>

Signed-off-by: Aaron Golden <[email protected]>
Co-authored-by: Paolo Di Tommaso <[email protected]>
@stale
Copy link

stale bot commented Mar 18, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Mar 18, 2023
@pditommaso pditommaso added this to the 23.04.0 milestone Mar 18, 2023
@stale stale bot removed the stale label Mar 18, 2023
@pditommaso
Copy link
Member

Closing this in favour of #3772

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants