Skip to content
This repository has been archived by the owner on Jan 25, 2018. It is now read-only.

Job shows Complete even though it failed #19

Open
dlogan opened this issue Nov 24, 2015 · 12 comments
Open

Job shows Complete even though it failed #19

dlogan opened this issue Nov 24, 2015 · 12 comments

Comments

@dlogan
Copy link
Member

dlogan commented Nov 24, 2015

http://imagewebrhel6/batchprofiler/cgi-bin/ViewBatch.py?batch_id=117, run.213.19.txt initially failed because a file didn't exist. (I had tried to fix a badly named file, but the filelist wasn't updated.) In any case, the error had said a file did not exist, however the Status says "Complete". I only knew to look because it finished in 24 sec, i.e. too soon.

Subsequently, I tried to get ViewBatch to show Resubmit by clicking the Delete All button. The txt and err files deleted, but there is no Resubmit button, i.e. the Status still says Complete, and I don't know how to resubmit this individual job run.213.19

@dlogan
Copy link
Member Author

dlogan commented Nov 24, 2015

Related to #12?

@dlogan
Copy link
Member Author

dlogan commented Nov 24, 2015

In fact, there are other jobs that failed and did not produce rows in the Per_Image table, yet show as "Complete".

These have Memory Errors:

http://imagewebrhel6/batchprofiler/cgi-bin/ViewTextFile.py?batch_array_id=213&task_id=21&file_type=text
http://imagewebrhel6/batchprofiler/cgi-bin/ViewTextFile.py?batch_array_id=213&task_id=22&file_type=text

@LeeKamentsky
Copy link

The current master branch (and the way it was on 9/10/15 at your checkout) exits with a status code of 0 even if there's an exception. Code that we're planning to check in has this facility in it.

I don't use the "done" file and maybe that's a mistake. I'd like to put working on BatchProfiler on the back burner for a couple of weeks, though, maybe afterwards, revisit.

@dlogan
Copy link
Member Author

dlogan commented Nov 24, 2015

But how can I resubmit the jobs that have failed? It seems impossible from any ViewBatch page since all batches (mis)report Complete. Can I submit via sudo as imageweb in any way? I looked at the job_scripts but I can't see how to do this.

@LeeKamentsky
Copy link

Sorry David, for this case, how about if I mark the ones that failed as
failed and then you can resubmit. I am guessing that the memory error is a
problem that will reoccur. Is it possible that a large number of cells or
particles are being segmented? The code is blowing up in a place that
suggests that.

On Tue, Nov 24, 2015 at 9:25 AM, David Logan [email protected]
wrote:

But how can I resubmit the jobs that have failed? It seems impossible from
any ViewBatch page since all batches (mis)report Complete. Can I submit via
sudo as imageweb in any way? I looked at the job_scripts but I can't see
how to do this.


Reply to this email directly or view it on GitHub
#19 (comment)
.

@dlogan
Copy link
Member Author

dlogan commented Nov 24, 2015

Sure, please mark them as failed (how can I do that myself?). Just now I raised the memory_limit in the batchprofiler_2/batch database because yes, there are lots of synapse objects per image -- will that still work as in the old db scheme for resubmitting?

@LeeKamentsky
Copy link

Raising the memory limit should work. To reset the status, you can do
something like this
Look at the text file names which are in the form,
run.<batch_array_id>.<task_id>.
Do the following select statement to get the task_status_id's to delete:

select task_status_id from run_job_status where batch_array_id = 213 and
task_id in (22, 23)

Then copy the task status IDs and do

delete from task_status where task_status_id in (203818, 203825)

I just did this for 203818 to see if it worked and it did. You can do it
for the other one. I'm running a script now to see if any other tasks
suffered from the same problem though, so perhaps you should hold off to
see if I found more.

On Tue, Nov 24, 2015 at 9:39 AM, David Logan [email protected]
wrote:

Sure, please mark them as failed (how can I do that myself?). Just now I
raised the memory_limit in the batchprofiler_2/batch database because yes,
there are lots of synapse objects per image -- will that still work as in
the old db scheme for resubmitting?


Reply to this email directly or view it on GitHub
#19 (comment)
.

@LeeKamentsky
Copy link

There were only the two... you can delete 203825 if you want to resubmit.

@dlogan
Copy link
Member Author

dlogan commented Nov 24, 2015

Cool - I think I get it now. I will delete 203825 and resubmit, thanks!

@dlogan
Copy link
Member Author

dlogan commented Nov 24, 2015

Wait - it looks to me like it is run.213.21.txt that is not done. (Also 213.19 for a different reason) Does that makes sense to you, rather than 213.23?

@LeeKamentsky
Copy link

I made run.213.21.txt's status change to test but left run.213.23.txt as "Done" so you could try out the delete. I don't know what you're running for MySQL, but it may be that you have to commit the transaction? (try tying "commit").

@dlogan
Copy link
Member Author

dlogan commented Nov 24, 2015

I was just being cautious before and trying to understand the procedure. I just ran

delete from task_status where task_status_id in (203825)

successfully and without even hitting (re)submit, it now reports Running. Is that right that it submits after the delete?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants