Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bonito/remora hanging on warning (remora) #216

Closed
mattloose opened this issue Dec 17, 2021 · 21 comments
Closed

bonito/remora hanging on warning (remora) #216

mattloose opened this issue Dec 17, 2021 · 21 comments
Assignees
Labels
bug Something isn't working

Comments

@mattloose
Copy link

I'm trying to run bonito on our cluster.

Running:

bonito basecaller [email protected] /path/to/fast5_files --modified-bases 5hmc_5mc --reference /path/to/ref/ref.mmi > out.bam

which results in:

> loading model [email protected]
> loading modified base model
> warning (remora): Remora model for basecall model version (v3.3) not found. Using default Remora model for dna_r9.4.1_e8_hac.

nvidia-smi shows:

Fri Dec 17 16:27:03 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:58:00.0 Off |                    0 |
| N/A   24C    P0    35W / 250W |   1575MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    119943      C   .../bonito-dev/bin/python3.6     1571MiB |
+-----------------------------------------------------------------------------+

The code hangs at this warning and nothing further happens.

Trying alternate models doesn't get any further.

Running with -vvv to get further output only provides the additional output of:

> loading model [email protected]
> model basecaller params: {'batchsize': 512, 'chunksize': 10000, 'overlap': 500, 'quantize': None}
> loading modified base model
> warning (remora): Remora model for basecall model version (v3.3) not found. Using default Remora model for dna_r9.4.1_e8.1_hac.

which doesn't really help much.

Has anyone got this working yet?

@iiSeymour iiSeymour self-assigned this Dec 17, 2021
@iiSeymour iiSeymour added the bug Something isn't working label Dec 17, 2021
@mattloose
Copy link
Author

For additional info:

bonito basecaller [email protected] gzip_files/ -vvv  > basecalls.fastq
> loading model [email protected]
> model basecaller params: {'batchsize': 512, 'chunksize': 10000, 'overlap': 500, 'quantize': None}
> outputting unaligned fastq
> calling: 2302 reads [00:57,  3.97 reads/s]

is working as expected.

So this seems to be something to do with the remora integrations?

@iiSeymour
Copy link
Member

Hey @mattloose

This is a new one!

From the output we can see L57 but not L61 so it's hanging in load_mods_model - @marcus1487 thoughts?

@mattloose
Copy link
Author

Yep - we were just playing with that to see if we could work out why!

@marcus1487
Copy link
Contributor

marcus1487 commented Dec 17, 2021

This should not be hanging like this, but I think the issue might be the format of the modified bases argument. Can you try running with --modified-bases 5hmc 5mc.

Remora should be hitting the KeyError and sys.exit here. There might be a bad interface between remora and bonito with the handling of this error.

@mattloose
Copy link
Author

That doesn't fix it - it hangs in the same way with 5mc on its own.

@marcus1487
Copy link
Contributor

Could you try running this command to confirm this is an issue with Remora: python -c "import logging; from remora import log; from remora.model_util import load_model; log.CONSOLE.setLevel(logging.DEBUG); mods_model = load_model(pore='dna_r9.4.1_e8', basecall_model_type='hac', modified_bases=['5mc']); print(f'> {mods_model[1][\"alphabet_str\"]}')"

@mattloose
Copy link
Author

Running that command (and editing https://github.com/nanoporetech/remora/blob/569ded04c45fcbd6eb079ead38d543e21c21a215/src/remora/model_util.py#L296 to be 0 not 3) gives us:

python -c "import logging; from remora import log; from remora.model_util import load_model; log.CONSOLE.setLevel(logging.DEBUG); mods_model = load_model(pore='dna_r9.4.1_e8', basecall_model_type='hac', modified_bases=['5mc']); print(f'> {mods_model[1][\"alphabet_str\"]}')"
[18:06:14] Basecall model version not supplied. Using default Remora model for dna_r9.4.1_e8_hac.
[18:06:14] Modified bases model type not supplied. Using default CG.
[18:06:14] Remora model version not specified. Using latest.
2021-12-17 18:06:14.213030263 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off
2021-12-17 18:06:14.213133972 [I:onnxruntime:, inference_session.cc:280 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true

and everything hangs here requiring a kill to exit.

@mattloose
Copy link
Author

Bonito/Remora do not seem to transfer a GPU device ID between them and so remora never seems able to access the GPU in the current implementation?

Presumably you are expecting this not to hang here?

@marcus1487
Copy link
Contributor

Thanks for the help debugging @mattloose ! Can you try to upgrade onnxruntime (pip install -U onnxruntime)? I've not extensively tested onnxruntime versions against remora models.

For the GPU device question, Remora models are quite lightweight and so are currently just setup to run on CPU (had troubles internally getting ONNX to run on GPU so decided to not bother with support for the moment). We've not seen too large an impact on runtime with the current suite of CG-context models. Though it is noticeable when using the fast models. The loading of models only on CPU may have to change when we move to all-context models eventually. The optimized guppy version of remora will likely run on the GPU and in a more efficient framework (coming January).

You are correct. This should not hang here. On my machine I get the below output from this command.

[12:56:25] Basecall model version not supplied. Using default Remora model for dna_r9.4.1_e8_hac.
[12:56:25] Modified bases model type not supplied. Using default CG.
[12:56:25] Remora model version not specified. Using latest.
DBG 12:56:25 : Remora model ONNX providers: ['CPUExecutionProvider'] --- MainProcess-MainThread model_util.py:284
> loaded modified base model to call (alt to C): m=5mC

@mattloose
Copy link
Author

mattloose commented Dec 17, 2021

Upgrading onnxruntime does not help. Same error - or rather same freeze point!

@marcus1487
Copy link
Contributor

To finally ensure that this is an onnxruntime issue can you run (from remora repo root for path to work) python -c "import onnxruntime as ort; model = ort.InferenceSession('models/trained_models/dna_r9.4.1_e8/hac/0.0.0/5mc/CG/v0/modbase_model.onnx', providers=['CPUExecutionProvider']); print(dict(model.get_modelmeta().custom_metadata_map)['mod_long_names_0'])".

For me this gives 5mC and then exits.

@marcus1487
Copy link
Contributor

@mattloose , Any chance you were able to run the last snippet I sent?

@mattloose
Copy link
Author

Sorry - yes I did try but I actually get no output from it at all. It appears to hang - I left it running but it was killed by our job submission engine.

I've not had time to try again over the holidays.

@marcus1487
Copy link
Contributor

Thanks! This definitely looks like an onnx bug. When you're back, could you add the output to the following code (assuming this will be the same as your previous post, but want to double check) and I'll raise the issue with ONNX.

import onnxruntime as ort
ort.set_default_logger_severity(0)
model = ort.InferenceSession('models/trained_models/dna_r9.4.1_e8/hac/0.0.0/5mc/CG/v0/modbase_model.onnx', providers=['CPUExecutionProvider'])

@mattloose
Copy link
Author

Hi,

Just jumped on and ran a quick test and I get:

import onnxruntime as ort
ort.set_default_logger_severity(0)
model = ort.InferenceSession('models/trained_models/dna_r9.4.1_e8/hac/0.0.0/5mc/CG/v0/modbase_model.onnx', providers=['CPUExecutionProvider'])
2022-01-03 19:53:44.686663926 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off
2022-01-03 19:53:44.686776610 [I:onnxruntime:, inference_session.cc:280 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true

Again - I have to kill the process.

Matt

@mattloose
Copy link
Author

Any progress on this?

@marcus1487
Copy link
Contributor

As noted in the onnxruntime issue, I think this has been tracked down to an onnxruntime bug. Awaiting reply from the developers there.

@mattloose
Copy link
Author

I just wondered if this would resolve by building a container. But it does not.

There's been no update on the onnxruntime issue either so I guess I will have to look at alternate hardware to test this.

@marcus1487
Copy link
Contributor

Remora release has been pushed. Please report if hanging/stalling issues persist.

@snajder-r
Copy link

@marcus1487 Can confirm that this fixed the issue for me. I still get the warning, but it no longer hangs and segfaults.

@mattloose
Copy link
Author

Hi @marcus1487 similarly can confirm that we are now running.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants