-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bonito/remora hanging on warning (remora) #216
Comments
For additional info:
is working as expected. So this seems to be something to do with the remora integrations? |
Hey @mattloose This is a new one! From the output we can see L57 but not L61 so it's hanging in |
Yep - we were just playing with that to see if we could work out why! |
This should not be hanging like this, but I think the issue might be the format of the modified bases argument. Can you try running with Remora should be hitting the |
That doesn't fix it - it hangs in the same way with 5mc on its own. |
Could you try running this command to confirm this is an issue with Remora: |
Running that command (and editing https://github.com/nanoporetech/remora/blob/569ded04c45fcbd6eb079ead38d543e21c21a215/src/remora/model_util.py#L296 to be 0 not 3) gives us:
and everything hangs here requiring a kill to exit. |
Bonito/Remora do not seem to transfer a GPU device ID between them and so remora never seems able to access the GPU in the current implementation? Presumably you are expecting this not to hang here? |
Thanks for the help debugging @mattloose ! Can you try to upgrade onnxruntime ( For the GPU device question, Remora models are quite lightweight and so are currently just setup to run on CPU (had troubles internally getting ONNX to run on GPU so decided to not bother with support for the moment). We've not seen too large an impact on runtime with the current suite of CG-context models. Though it is noticeable when using the fast models. The loading of models only on CPU may have to change when we move to all-context models eventually. The optimized guppy version of remora will likely run on the GPU and in a more efficient framework (coming January). You are correct. This should not hang here. On my machine I get the below output from this command.
|
Upgrading onnxruntime does not help. Same error - or rather same freeze point! |
To finally ensure that this is an onnxruntime issue can you run (from remora repo root for path to work) For me this gives |
@mattloose , Any chance you were able to run the last snippet I sent? |
Sorry - yes I did try but I actually get no output from it at all. It appears to hang - I left it running but it was killed by our job submission engine. I've not had time to try again over the holidays. |
Thanks! This definitely looks like an onnx bug. When you're back, could you add the output to the following code (assuming this will be the same as your previous post, but want to double check) and I'll raise the issue with ONNX.
|
Hi, Just jumped on and ran a quick test and I get:
Again - I have to kill the process. Matt |
Any progress on this? |
As noted in the onnxruntime issue, I think this has been tracked down to an onnxruntime bug. Awaiting reply from the developers there. |
I just wondered if this would resolve by building a container. But it does not. There's been no update on the onnxruntime issue either so I guess I will have to look at alternate hardware to test this. |
Remora release has been pushed. Please report if hanging/stalling issues persist. |
@marcus1487 Can confirm that this fixed the issue for me. I still get the warning, but it no longer hangs and segfaults. |
Hi @marcus1487 similarly can confirm that we are now running. Thanks. |
I'm trying to run bonito on our cluster.
Running:
which results in:
nvidia-smi shows:
The code hangs at this warning and nothing further happens.
Trying alternate models doesn't get any further.
Running with -vvv to get further output only provides the additional output of:
which doesn't really help much.
Has anyone got this working yet?
The text was updated successfully, but these errors were encountered: