Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance drop during Read Processing, maxing out "output queue capacity basecalls" but not I/O #46

Closed
AAnnan opened this issue Aug 5, 2020 · 4 comments

Comments

@AAnnan
Copy link

AAnnan commented Aug 5, 2020

Hello,

I experienced a significant drop in read processing performance (from ~22 reads/s to 10 reads/s and falling) with Megalodon after ~580 000 reads processed (~7h) when the output queue capacity basecalls was maxed out (10000/10000). The output states that this is a sign of I/O bottleneck but upon checking monitoring tools like iostat or iotop, the state of the drive (2TB NVMe) looked fine and not at all fully used. What could be going wrong?

Guppy Basecall Server
Guppy Basecall Service Software, (C) Oxford Nanopore Technologies, Limited. Version 4.0.14+8d3226e, client-server API version 2.1.0

Megalodon
Megalodon version: 2.1.1

Megalodon command

megalodon ./final_fast5s/ --guppy-server-path ${GUPPY_DIR}/guppy_basecall_server \
        --guppy-params "-d ./rerio/basecall_models/" \
        --guppy-config res_dna_r941_min_modbases-all-context_v001.cfg \
        --outputs basecalls mod_basecalls mappings mods per_read_mods mod_mappings \
        --output-directory ./mega_results/ \
        --reference $genomeFile \
        --mod-motif Z GCG 1 --mod-motif Z HCG 1 --mod-motif Z GCH 1 \
        --write-mods-text \
        --mod-aggregate-method binary_threshold \
        --mod-binary-threshold 0.875 \
        --mod-output-formats bedmethyl wiggle \
        --mod-map-base-conv C T --mod-map-base-conv Z C \
        --devices 0 --processes 30

IOstat

Device             tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
nvme0n1          87.25         0.73         4.88     439415    2923495
@marcus1487
Copy link
Collaborator

I have not experienced a bottleneck in the basecalling queue before, so this is just my best guess, but I would suspect that the mod_basecalls output is likely the bottleneck here. This was not a very efficient format for a large dataset. Mod basecalls are currently output in a table in an HDF5 format file. This file will have one dataset per read (so 580k in your case here). HDF5 might be taking too much time to create new data sets on a file this size while not actually depending directly on the filesystem. If I am correct, then simply removing the mod_basecalls output from the command should alleviate this issue. This output is intended to be swapped out for an unmapped SAM/BAM/CRAM file as specified by the hts-spec group (see here), which should be much more efficient for storage and retrieval.

@AAnnan
Copy link
Author

AAnnan commented Aug 5, 2020

Thanks for your reply.

I'll leave the mod_basecalls output out of my next runs, see how it goes and report back.

Looking forward to the new format from hts-spec.

@AAnnan
Copy link
Author

AAnnan commented Aug 6, 2020

I confirm that leaving out the mod_basecalls output fixes the problem. Leaving this out, I got no fill up of the output queue and no decrease in performance, even after 800k reads processed.

@AAnnan AAnnan closed this as completed Aug 6, 2020
@marcus1487
Copy link
Collaborator

This should be resolved in the 2.2 release. This output has changed to the SAM/BAM/CRAM format. See README for details.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants