pairtools parse, dedup and sort use only a few cores per process #214

jiangshan529 · 2023-10-24T20:33:24Z

jiangshan529
Oct 24, 2023

Hi, dear all,

I am using pairtools to process some hic experiemental files. I want to set the cores to be used to 96 for parse and dedup command, but actually only 1-2 cores are really been used! Is there a way to really accelarate the processing time? Thanks!

samtools view -@ 96 -h 150.bam | pairtools parse -c hg38.chrXYM.chrom.sizes --add-columns mapq --drop-sam --drop-seq --walks-policy all --nproc-in 96 --nproc-out 96  | pairtools sort --nproc 96 --memory 128G --tmpdir ./ --output 150.pairs.gz

Phlya · 2023-10-24T21:29:16Z

Phlya
Oct 24, 2023
Maintainer

In your command there is no dedup... But anyway, basically no... The cores you set to 96 here are just for the reading/writing de/compression processes. They don't need so many to be very fast.
For the actual algortithms, they aren't really parallelized. Dedup technically is, but in practice I haven't noticed an increase in performance with multiple cores with the scipy or sklearn backends (and with cython there is no parallelization implemented).

With very large datasets the typical approach if you have access to a large server or a cluster is to split the fastq files into chunk from the beginning and do all steps up until dedup with chunks, this can speed it up a lot. For dedup you have to merge, otherwise duplicates will be missed, and then it just takes time.

You can check out our pipeline that implements the chunking and all other steps: https://github.com/open2c/distiller-nf

0 replies

golobor · 2024-03-17T14:14:13Z

golobor
Mar 17, 2024
Maintainer

in case of pairtools sort, one can also increase the size of the memory buffer, which will in turn improve utilization of multiple cores:
https://superuser.com/questions/938558/sort-parallel-isnt-parallelizing

2 replies

jiangshan529 Mar 18, 2024
Author

hi, golobor! Thanks for your information. Now I have another problem in pairtools merge:

(base) root@ray-m5-15-bc17-head-f5cb7de2-compute:/home/WT# pairtools merge --max-nmerge 7 --nproc 8 --memory 100G --tmpdir ./ *.pairs.gz --output merge_UU.pairs.gz
Killed
Traceback (most recent call last):
File "/opt/conda/bin/pairtools", line 11, in
sys.exit(cli())
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pairtools/cli/merge.py", line 134, in merge
merge_py(
File "/opt/conda/lib/python3.10/site-packages/pairtools/cli/merge.py", line 254, in merge_py
subprocess.check_call(command, shell=True, stdout=outstream)
File "/opt/conda/lib/python3.10/subprocess.py", line 369, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command ' /bin/bash -c 'export LC_COLLATE=C; export LANG=C; sort -k 2,2 -k 4,4 -k 3,3n -k 5,5n -k 8,8 --merge --field-separator=$'''\t''' --parallel=8 --batch-size=7 --temporary-directory=./ -S 100G <(bgzip -dc -@ 1 22G3KTLT3_1_0420637840_MICROC25_S11_S15_L001_UU_dedup.pairs.gz | sed -n -e '''/^[^#]/,$p''') <(bgzip -dc -@ 1 23022FL-05-01-15_S15_L008_UU_dedup.pairs.gz | sed -n -e '''/^[^#]/,$p''') <(bgzip -dc -@ 1 23022FL-06-01-11_S11_L008_UU_dedup.pairs.gz | sed -n -e '''/^[^#]/,$p''')'' returned non-zero exit status 137.

golobor Mar 18, 2024
Maintainer

the process seems to be killed. I'd assume it ran out of memory?

jiangshan529 · 2024-03-18T04:24:09Z

jiangshan529
Mar 18, 2024
Author

All the four files are processed using the same pipeline, and one sample is downsampled.

(base) root@ray-m5-15-bc17-head-f5cb7de2-compute:/home/WT# ls -lh
total 9.7G
-rw-r--r-- 1 root root 4.9G Mar 18 03:30 22G3KTLT3_1_0420637840_MICROC25_S11_S15_L001_UU_dedup.pairs.gz
-rw-r--r-- 1 root root 2.2G Mar 18 03:30 23022FL-05-01-15_S15_L008_UU_dedup.pairs.gz
-rw-r--r-- 1 root root 2.2G Mar 18 03:31 23022FL-06-01-11_S11_L008_UU_dedup.pairs.gz
-rw-r--r-- 1 root root 547M Mar 18 03:31 23022FL-07-01-22_S21_L008_UU_dedup_downsample261M.pairs.gz
(base) root@ray-m5-15-bc17-head-f5cb7de2-compute:/home/WT# pairtools merge --max-nmerge 7 --nproc 8 --memory 100G --tmpdir ./ ./*.pairs.gz --output merge_UU.pairs.gz
invalid block header
reader reader_read_block: bug encountered
WARNING:pairtools:Headerless input, please, add the header by pairtools header generate or pairtools header transfer
Traceback (most recent call last):
File "/opt/conda/bin/pairtools", line 11, in
sys.exit(cli())
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1157, in call
return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/pairtools/cli/merge.py", line 134, in merge
merge_py(
File "/opt/conda/lib/python3.10/site-packages/pairtools/cli/merge.py", line 192, in merge_py
raise ValueError("Input pairs cannot contain different columns")
ValueError: Input pairs cannot contain different columns

1 reply

golobor Mar 18, 2024
Maintainer

Please, let's not contaminate this discussion thread with unrelated questions. This seems like a separate issue/question, please open a separate issue.
You got two error messages - one re: headerless input, another re: different columns. Did you check if all input pairs contain headers and if they have the same columns? I would assume that your downsampled set of pairs is missing a header.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pairtools parse, dedup and sort use only a few cores per process #214

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

pairtools parse, dedup and sort use only a few cores per process #214

jiangshan529 Oct 24, 2023

Replies: 3 comments · 3 replies

Phlya Oct 24, 2023 Maintainer

golobor Mar 17, 2024 Maintainer

jiangshan529 Mar 18, 2024 Author

golobor Mar 18, 2024 Maintainer

jiangshan529 Mar 18, 2024 Author

golobor Mar 18, 2024 Maintainer

jiangshan529
Oct 24, 2023

Replies: 3 comments 3 replies

Phlya
Oct 24, 2023
Maintainer

golobor
Mar 17, 2024
Maintainer

jiangshan529 Mar 18, 2024
Author

golobor Mar 18, 2024
Maintainer

jiangshan529
Mar 18, 2024
Author

golobor Mar 18, 2024
Maintainer