bonito/remora hanging on warning (remora) #216

mattloose · 2021-12-17T16:38:18Z

I'm trying to run bonito on our cluster.

Running:

bonito basecaller [email protected] /path/to/fast5_files --modified-bases 5hmc_5mc --reference /path/to/ref/ref.mmi > out.bam

which results in:

> loading model [email protected]
> loading modified base model
> warning (remora): Remora model for basecall model version (v3.3) not found. Using default Remora model for dna_r9.4.1_e8_hac.

nvidia-smi shows:

Fri Dec 17 16:27:03 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-PCIE...  Off  | 00000000:58:00.0 Off |                    0 |
| N/A   24C    P0    35W / 250W |   1575MiB / 16160MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A    119943      C   .../bonito-dev/bin/python3.6     1571MiB |
+-----------------------------------------------------------------------------+

The code hangs at this warning and nothing further happens.

Trying alternate models doesn't get any further.

Running with -vvv to get further output only provides the additional output of:

> loading model [email protected]
> model basecaller params: {'batchsize': 512, 'chunksize': 10000, 'overlap': 500, 'quantize': None}
> loading modified base model
> warning (remora): Remora model for basecall model version (v3.3) not found. Using default Remora model for dna_r9.4.1_e8.1_hac.

which doesn't really help much.

Has anyone got this working yet?

The text was updated successfully, but these errors were encountered:

mattloose · 2021-12-17T17:03:02Z

For additional info:

bonito basecaller [email protected] gzip_files/ -vvv  > basecalls.fastq
> loading model [email protected]
> model basecaller params: {'batchsize': 512, 'chunksize': 10000, 'overlap': 500, 'quantize': None}
> outputting unaligned fastq
> calling: 2302 reads [00:57,  3.97 reads/s]

is working as expected.

So this seems to be something to do with the remora integrations?

iiSeymour · 2021-12-17T17:10:03Z

Hey @mattloose

This is a new one!

From the output we can see L57 but not L61 so it's hanging in load_mods_model - @marcus1487 thoughts?

mattloose · 2021-12-17T17:10:41Z

Yep - we were just playing with that to see if we could work out why!

marcus1487 · 2021-12-17T17:23:38Z

This should not be hanging like this, but I think the issue might be the format of the modified bases argument. Can you try running with --modified-bases 5hmc 5mc.

Remora should be hitting the KeyError and sys.exit here. There might be a bad interface between remora and bonito with the handling of this error.

mattloose · 2021-12-17T17:26:06Z

That doesn't fix it - it hangs in the same way with 5mc on its own.

marcus1487 · 2021-12-17T17:57:10Z

Could you try running this command to confirm this is an issue with Remora: python -c "import logging; from remora import log; from remora.model_util import load_model; log.CONSOLE.setLevel(logging.DEBUG); mods_model = load_model(pore='dna_r9.4.1_e8', basecall_model_type='hac', modified_bases=['5mc']); print(f'> {mods_model[1][\"alphabet_str\"]}')"

mattloose · 2021-12-17T18:08:30Z

Running that command (and editing https://github.com/nanoporetech/remora/blob/569ded04c45fcbd6eb079ead38d543e21c21a215/src/remora/model_util.py#L296 to be 0 not 3) gives us:

python -c "import logging; from remora import log; from remora.model_util import load_model; log.CONSOLE.setLevel(logging.DEBUG); mods_model = load_model(pore='dna_r9.4.1_e8', basecall_model_type='hac', modified_bases=['5mc']); print(f'> {mods_model[1][\"alphabet_str\"]}')"
[18:06:14] Basecall model version not supplied. Using default Remora model for dna_r9.4.1_e8_hac.
[18:06:14] Modified bases model type not supplied. Using default CG.
[18:06:14] Remora model version not specified. Using latest.
2021-12-17 18:06:14.213030263 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off
2021-12-17 18:06:14.213133972 [I:onnxruntime:, inference_session.cc:280 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true

and everything hangs here requiring a kill to exit.

mattloose · 2021-12-17T18:10:09Z

Bonito/Remora do not seem to transfer a GPU device ID between them and so remora never seems able to access the GPU in the current implementation?

Presumably you are expecting this not to hang here?

marcus1487 · 2021-12-17T18:19:03Z

Thanks for the help debugging @mattloose ! Can you try to upgrade onnxruntime (pip install -U onnxruntime)? I've not extensively tested onnxruntime versions against remora models.

For the GPU device question, Remora models are quite lightweight and so are currently just setup to run on CPU (had troubles internally getting ONNX to run on GPU so decided to not bother with support for the moment). We've not seen too large an impact on runtime with the current suite of CG-context models. Though it is noticeable when using the fast models. The loading of models only on CPU may have to change when we move to all-context models eventually. The optimized guppy version of remora will likely run on the GPU and in a more efficient framework (coming January).

You are correct. This should not hang here. On my machine I get the below output from this command.

[12:56:25] Basecall model version not supplied. Using default Remora model for dna_r9.4.1_e8_hac.
[12:56:25] Modified bases model type not supplied. Using default CG.
[12:56:25] Remora model version not specified. Using latest.
DBG 12:56:25 : Remora model ONNX providers: ['CPUExecutionProvider'] --- MainProcess-MainThread model_util.py:284
> loaded modified base model to call (alt to C): m=5mC

mattloose · 2021-12-17T19:48:12Z

Upgrading onnxruntime does not help. Same error - or rather same freeze point!

marcus1487 · 2021-12-17T20:30:08Z

To finally ensure that this is an onnxruntime issue can you run (from remora repo root for path to work) python -c "import onnxruntime as ort; model = ort.InferenceSession('models/trained_models/dna_r9.4.1_e8/hac/0.0.0/5mc/CG/v0/modbase_model.onnx', providers=['CPUExecutionProvider']); print(dict(model.get_modelmeta().custom_metadata_map)['mod_long_names_0'])".

For me this gives 5mC and then exits.

marcus1487 · 2022-01-03T18:36:45Z

@mattloose , Any chance you were able to run the last snippet I sent?

mattloose · 2022-01-03T18:59:30Z

Sorry - yes I did try but I actually get no output from it at all. It appears to hang - I left it running but it was killed by our job submission engine.

I've not had time to try again over the holidays.

marcus1487 · 2022-01-03T19:49:01Z

Thanks! This definitely looks like an onnx bug. When you're back, could you add the output to the following code (assuming this will be the same as your previous post, but want to double check) and I'll raise the issue with ONNX.

import onnxruntime as ort
ort.set_default_logger_severity(0)
model = ort.InferenceSession('models/trained_models/dna_r9.4.1_e8/hac/0.0.0/5mc/CG/v0/modbase_model.onnx', providers=['CPUExecutionProvider'])

mattloose · 2022-01-03T19:58:45Z

Hi,

Just jumped on and ran a quick test and I get:

import onnxruntime as ort
ort.set_default_logger_severity(0)
model = ort.InferenceSession('models/trained_models/dna_r9.4.1_e8/hac/0.0.0/5mc/CG/v0/modbase_model.onnx', providers=['CPUExecutionProvider'])
2022-01-03 19:53:44.686663926 [I:onnxruntime:, inference_session.cc:273 operator()] Flush-to-zero and denormal-as-zero are off
2022-01-03 19:53:44.686776610 [I:onnxruntime:, inference_session.cc:280 ConstructorCommon] Creating and using per session threadpools since use_per_session_threads_ is true

Again - I have to kill the process.

Matt

mattloose · 2022-01-11T14:25:08Z

Any progress on this?

marcus1487 · 2022-01-11T20:30:19Z

As noted in the onnxruntime issue, I think this has been tracked down to an onnxruntime bug. Awaiting reply from the developers there.

mattloose · 2022-01-23T14:18:41Z

I just wondered if this would resolve by building a container. But it does not.

There's been no update on the onnxruntime issue either so I guess I will have to look at alternate hardware to test this.

marcus1487 · 2022-01-26T23:35:51Z

Remora release has been pushed. Please report if hanging/stalling issues persist.

snajder-r · 2022-01-27T10:17:26Z

@marcus1487 Can confirm that this fixed the issue for me. I still get the warning, but it no longer hangs and segfaults.

mattloose · 2022-01-27T10:49:04Z

Hi @marcus1487 similarly can confirm that we are now running.

Thanks.

iiSeymour self-assigned this Dec 17, 2021

iiSeymour added the bug Something isn't working label Dec 17, 2021

iiSeymour assigned marcus1487 Dec 17, 2021

marcus1487 mentioned this issue Jan 3, 2022

InferenceSession initialization hangs microsoft/onnxruntime#10166

Open

marcus1487 mentioned this issue Jan 26, 2022

CRF models are not fully supported. nanoporetech/remora#5

Closed

mattloose closed this as completed Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bonito/remora hanging on warning (remora) #216

bonito/remora hanging on warning (remora) #216

mattloose commented Dec 17, 2021

mattloose commented Dec 17, 2021

iiSeymour commented Dec 17, 2021

mattloose commented Dec 17, 2021

marcus1487 commented Dec 17, 2021 •

edited

Loading

mattloose commented Dec 17, 2021

marcus1487 commented Dec 17, 2021

mattloose commented Dec 17, 2021

mattloose commented Dec 17, 2021

marcus1487 commented Dec 17, 2021

mattloose commented Dec 17, 2021 •

edited

Loading

marcus1487 commented Dec 17, 2021

marcus1487 commented Jan 3, 2022

mattloose commented Jan 3, 2022

marcus1487 commented Jan 3, 2022

mattloose commented Jan 3, 2022

mattloose commented Jan 11, 2022

marcus1487 commented Jan 11, 2022

mattloose commented Jan 23, 2022

marcus1487 commented Jan 26, 2022

snajder-r commented Jan 27, 2022

mattloose commented Jan 27, 2022

bonito/remora hanging on warning (remora) #216

bonito/remora hanging on warning (remora) #216

Comments

mattloose commented Dec 17, 2021

mattloose commented Dec 17, 2021

iiSeymour commented Dec 17, 2021

mattloose commented Dec 17, 2021

marcus1487 commented Dec 17, 2021 • edited Loading

mattloose commented Dec 17, 2021

marcus1487 commented Dec 17, 2021

mattloose commented Dec 17, 2021

mattloose commented Dec 17, 2021

marcus1487 commented Dec 17, 2021

mattloose commented Dec 17, 2021 • edited Loading

marcus1487 commented Dec 17, 2021

marcus1487 commented Jan 3, 2022

mattloose commented Jan 3, 2022

marcus1487 commented Jan 3, 2022

mattloose commented Jan 3, 2022

mattloose commented Jan 11, 2022

marcus1487 commented Jan 11, 2022

mattloose commented Jan 23, 2022

marcus1487 commented Jan 26, 2022

snajder-r commented Jan 27, 2022

mattloose commented Jan 27, 2022

marcus1487 commented Dec 17, 2021 •

edited

Loading

mattloose commented Dec 17, 2021 •

edited

Loading