Skip to content

Commit

Permalink
Bug fix: correct sorting for bedtools intersect when applying blackli…
Browse files Browse the repository at this point in the history
…st mask

Background
- To reduce memory usage when applying the blacklist mask (rule repeat_mask), the option -sorted was added to the bedtools intersect command in commit 2703961.
- bedtools intersect with the -sorted option requires that all input files be position-sorted.

Problem: the input files were sorted, but not in the way bedtools intersect expected them to be
- Input 1 = chromatin alignment BAM file
  - This is produced by Bowtie 2 followed by samtools sort (rule bowtie2_align), then the chromosomes are renamed in the rule rename_and_filter_chr.
    - The chromosome renaming step is essential to ensure that the chromosome names in the BAM file (which originally come from the Bowtie 2 index) match the chromosome names in the blacklist mask file.
    - This BAM file is coordinate sorted using the order of chromosomes in the BAM header, which uses the order of chromosomes in the chromosome name map file. Importantly, this order is user-defined and is not necessarily lexicographically ordered!
- Input 2 = blacklist mask file
  - In the original upstream merge_mask rule, this mask was coordinate sorted using sort -k1,1 -k2,2n, which will lexicographically order the chromosome names.

Fix: The fix consisted of 2 changes
1. Use the same chromosome sorting order for the BAM alignment file as the mask. Specifically, sort the mask file using the order of chromosomes in the chromosome name map file.
2. bedtools intersect with the -sorted flag alone appears to expect chromosomes sorted lexicographically. Consequently, a "genome file" (bedtools terminology) needs to be passed using the -g option that specifies the order of chromosomes desired from the chromosome name map. See also issues 1116 and 1117 on the bedtools GitHub repository (arq5x/bedtools2#1116, arq5x/bedtools2#1117).

Other changes
- Removed documentation in README.md and code comments regarding requirement for Python 3.9+.
  - Commit 85f3574 removed the use of the Python 3.9+-specific dict | dict operation in rename_and_filter_chr.py:reheader()
- helpers.py:parse_chrom_map() now uses file_open() instead of open() for reading a chromosome name map file
  • Loading branch information
bentyeh committed Feb 8, 2025
1 parent 08f7306 commit 008037b
Show file tree
Hide file tree
Showing 4 changed files with 139 additions and 27 deletions.
5 changes: 2 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,8 +92,7 @@ Code intepreters and runtimes: The pipeline relies on scripts written in Java, B
- Python: 3.9+ (the `envs/chipdip.yaml` conda environment file currently uses version 3.10)

Packages: Additional required third-party programs and packages are specified in conda environments described in `envs/`.
- Note: Other versions of the same software programs will likely work, but we have not tested all of them. Some specific requirements are discussed below.
- Some of the scripts (such as `rename_and_filter_chr.py`) take advantage of features introduced in Python 3.9.
- Note: Other recent versions of the same software programs will likely work, but we have not tested all of them. Some specific requirements are discussed below.
- The version of `deeptools` used (3.5.2) requires a `matplotlib` version between 3.1.0 and 3.7.5, since later versions of matplotlib deprecate some functions used by `deeptools` version 3.5.2. Newer versions of `deeptools` have been updated to support the newer `matplotlib` APIs.
- NumPy version 2.x is not currently supported.

Expand Down Expand Up @@ -356,7 +355,7 @@ However, the pipeline directory can also be kept separate and used repeatedly on
7. <a name="chrom-map">`chrom_map.txt`</a>: Chromosome names map - tab-delimited text file specifying which chromosomes from the Bowtie 2 index to keep and how to rename them (if at all).
- Required? No, but necessary if using a [blacklist mask](#blacklist-bed) that uses different chromosome names than used in the Bowtie 2 index.
- [`config.yaml`](#config-yaml) key to specify the path to this file: `path_chrom_map`
- Used by: `scripts/python/rename_and_filter_chr.py` (Snakefile `rule rename_and_filter_chr` and `rule effective_genome_size`)
- Used by: `scripts/python/rename_and_filter_chr.py` (Snakefile `rule rename_and_filter_chr`, `rule merge_mask`, and `rule effective_genome_size`)
- Column 1 specifies chromosomes (following naming convention used in the index) to keep.
- The order of chromosomes provided here is maintained in the SAM/BAM file
header, and consequently specifies the coordinate sorting order at the
Expand Down
58 changes: 49 additions & 9 deletions Snakefile
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ else:

path_chrom_map = config.get("path_chrom_map")
if path_chrom_map in (None, ""):
path_chrom_map = ""
print("Chromosome names not specified, will use all chromosomes in the Bowtie 2 index.",
file=sys.stderr)

Expand Down Expand Up @@ -774,7 +775,7 @@ rule rename_and_filter_chr:
log:
os.path.join(DIR_LOGS, "{sample}.{splitid}.rename_and_filter_chr.log")
params:
chrom_map = f"--chrom_map '{path_chrom_map}'" if path_chrom_map not in (None, "") else ""
chrom_map = f"--chrom_map '{path_chrom_map}'" if path_chrom_map != "" else ""
conda:
conda_env
threads:
Expand All @@ -786,27 +787,60 @@ rule rename_and_filter_chr:
'''

# Merge mask
# - This should increase the speed of the repeat_mask rule compared to a bed file with many overlapping regions.
# - The merged mask is also used in the generate_bigwigs rule, if that is enabled.
# - Merging overlapping regions should increase the speed of running bedtools intersect in the repeat_mask rule.
# - The merged mask is also used in the generate_bigwigs rule.
# - The merged mask is sorted as follows:
# - If a chromosome name map is provided, then it is sorted by the new chromosome names (i.e., according to names in
# the second column of the chromosome name map file). Entries with chromosome names not in the chromosome name map
# are discarded.
# - If a chromosome name map is not provided, then it is sorted by the order of chromosomes in the Bowtie 2 index.
rule merge_mask:
output:
temp(os.path.join(DIR_WORKUP, "mask_merge.bed"))
bed = temp(os.path.join(DIR_WORKUP, "mask_merge.bed")),
genome = temp(os.path.join(DIR_WORKUP, "mask_merge.genome"))
conda:
conda_env
shell:
'''
if [ -n "{mask}" ]; then
sort -k1,1 -k2,2n "{mask}" | bedtools merge > "{output}"
if [ -n "{path_chrom_map}" ]; then
# chromosome name map is provided --> sort chromosomes by the new chromosome names
sort -k1,1 -k2,2n "{mask}" |
bedtools merge |
python "{rename_and_filter_chr}" --bed --chrom_map "{path_chrom_map}" - > "{output.bed}"
# create genome file for bedtools intersect
join -t $'\t' \
<(grep -E -e '^[^"]\s*\S+\s*' "{path_chrom_map}" ) \
<(bowtie2-inspect -s "{bowtie2_index}" |
grep -E -e '^Sequence-[0-9]+' |
cut -f 2,3) |
cut -f 2,3 > "{output.genome}"
else
# chromosome name map is not provided --> sort chromosomes by their order in the Bowtie 2 index
sort -k1,1 -k2,2n "{mask}" |
bedtools merge |
python "{rename_and_filter_chr}" \
--bed \
--chrom_map <(bowtie2-inspect -n "{bowtie2_index}" | sed -E 's/(\S+)/\1\t\1/') \
- > "{output.bed}"
# create genome file for bedtools intersect
bowtie2-inspect -s "{bowtie2_index}" |
grep -E -e '^Sequence-[0-9]+' |
cut -f 2,3 > "{output.genome}"
fi
else
touch "{output}"
touch "{output.bed}" "{output.genome}"
fi
'''

# Repeat mask aligned DNA reads
rule repeat_mask:
input:
bam = os.path.join(DIR_WORKUP, "alignments_parts", "{sample}.part_{splitid}.DNA.chr.bam"),
mask = os.path.join(DIR_WORKUP, "mask_merge.bed")
mask = os.path.join(DIR_WORKUP, "mask_merge.bed"),
genome = os.path.join(DIR_WORKUP, "mask_merge.genome")
output:
os.path.join(DIR_WORKUP, "alignments_parts", "{sample}.part_{splitid}.DNA.chr.masked.bam")
log:
Expand All @@ -817,7 +851,13 @@ rule repeat_mask:
'''
{{
if [ -n "{mask}" ]; then
bedtools intersect -v -a "{input.bam}" -b "{input.mask}" -sorted > "{output}"
# -v: only report entries in A that have no overlap in B
bedtools intersect \
-v \
-a "{input.bam}" \
-b "{input.mask}" \
-sorted \
-g "{input.genome}" > "{output}"
else
echo "No mask file specified, skipping masking."
ln -s "{input.bam}" "{output}"
Expand Down Expand Up @@ -1211,7 +1251,7 @@ rule effective_genome_size:
log:
os.path.join(DIR_LOGS, "effective_genome_size.log")
params:
chrom_map = f"--chrom_map '{path_chrom_map}'" if path_chrom_map not in (None, "") else "",
chrom_map = f"--chrom_map '{path_chrom_map}'" if path_chrom_map != "" else ""
conda:
conda_env
threads:
Expand Down
2 changes: 1 addition & 1 deletion scripts/python/helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,7 @@ def parse_chrom_map(path):
Parse a chromosome name map file to a dict mapping old chromosome names to new names.
"""
chrom_map = dict()
with open(path, 'rt') as f:
with file_open(path, mode='rt') as f:
for line in f:
if line.strip() == '' or line.strip().startswith('"'):
continue
Expand Down
101 changes: 87 additions & 14 deletions scripts/python/rename_and_filter_chr.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
"""
Rename and select chromosomes in a FASTA or BAM file.
Rename chromosomes in a FASTA, BED, or BAM file.
"""

import argparse
Expand All @@ -12,6 +12,7 @@
import sys
sys.path.append(os.path.abspath(os.path.dirname(__file__)))
import helpers
import pandas as pd
import pysam


Expand Down Expand Up @@ -68,6 +69,13 @@ def main():
args.output,
chrom_map
)
elif args.bed:
filter_bed(
args.input,
args.output,
chrom_map,
sort=args.sort
)
else:
filter_reads(
args.input,
Expand All @@ -81,28 +89,32 @@ def main():
)



def parse_arguments():
parser = argparse.ArgumentParser(
description=(
"Rename and select chromosomes in a FASTA or BAM file. "
"For BAM files, keep only reads aligned to selected chromosomes, and "
"reorder the chromosomes in the BAM file header."
"Rename chromosomes. For BAM/BED inputs, keep only reads/regions on those chromosomes. Additionally for "
"BAM files, reorder the chromosomes in the BAM file header."
)
)
parser.add_argument(
"input",
metavar="in.bam|in.fasta(.gz)|-",
help=(
"Input file. Assumed to be BAM format unless the -f/--fasta flag is used. "
"Input file. Assumed to be BAM format unless the -f/--fasta or -b/--bed flags are used. "
"Use '-' for standard input."
)
)
parser.add_argument(
group = parser.add_mutually_exclusive_group()
group.add_argument(
"-f", "--fasta",
action="store_true",
help="Input file is FASTA format."
)
group.add_argument(
"-b", "--bed",
action="store_true",
help="Input file is BED format. (Current implementation is not memory efficient.)"
)
parser.add_argument(
"-o", "--output",
metavar="out.bam|out.fasta",
Expand Down Expand Up @@ -134,10 +146,10 @@ def parse_arguments():
choices=("auto", "true", "false"),
default="auto",
help=(
"(Only relevant if -c/--chrom_map is specified and the input is a BAM file) "
"Whether to sort the processed BAM file. "
"If 'auto', will only re-sort if the order of chromosomes given in the chromosome name map "
"is different than the existing chromosome order."
"Only relevant if -c/--chrom_map is specified and input type is not FASTA. "
"(For BAM input) If 'auto', will only re-sort if the order of chromosomes given in the chromosome name map "
"is different than the existing chromosome order. "
"(For BED input) If 'auto' or 'true', sort based on the order of chromosomes in the chromosome name map."
)
)
parser.add_argument(
Expand Down Expand Up @@ -177,9 +189,6 @@ def reheader(old_header, chrom_map):
- retains_sorting: bool
Whether the order of entries in chrom_map is consistent with existing order
of chromosomes in the old header.
Note: The current implementation of this function takes advantage of dict features
defined/introduced in Python versions >= 3.9.
"""
# copy existing BAM header to a new header dict without reference sequences
new_header = copy.deepcopy(old_header)
Expand Down Expand Up @@ -311,6 +320,9 @@ def filter_fasta(path_in, path_out, chrom_map) -> None:
# convert path_in to a text IO stream if necessary
if isinstance(path_in, io.IOBase) and not isinstance(path_in, io.TextIOBase):
first_two_bytes = path_in.peek(2)[:2]
if len(first_two_bytes) != 2:
first_two_bytes = path_in.read(2)
path_in.seek(0)
if first_two_bytes == helpers.GZIP_MAGIC_NUMBER:
f = gzip.GzipFile(fileobj=path_in, mode="rb")
else:
Expand Down Expand Up @@ -343,5 +355,66 @@ def filter_fasta(path_in, path_out, chrom_map) -> None:
elif include:
path_out.write(line)


def filter_bed(path_in, path_out, chrom_map, sort="auto"):
"""
Rename chromosomes in a BED file according to a chromosome name map, and and output only regions on chromosomes in
the chromosome name map.
Args
- path_in: str or file object
Path to input BED file or file object.
Supports normal text or gzip-compressed text.
- path_out: str, file object, or None
Path to output BED file (supports .gz extension for gzip compression) or file object.
If None, writes to standard out.
- chrom_map: dict (str -> str)
Map from old reference sequence names to new reference sequence names.
- sort: ('auto', 'true', or 'false'). default='auto'
Whether to coordinate sort the reads before writing to path_out.
"""
if path_out is None:
path_out = sys.stdout

if isinstance(path_in, io.IOBase) and not isinstance(path_in, io.TextIOBase):
first_two_bytes = path_in.peek(2)[:2]
if len(first_two_bytes) != 2:
first_two_bytes = path_in.read(2)
path_in.seek(0)
if first_two_bytes == helpers.GZIP_MAGIC_NUMBER:
path_in = gzip.GzipFile(fileobj=path_in, mode="rb")

DTYPE_CHROM_NEW = pd.CategoricalDtype(categories=chrom_map.values(), ordered=True)
df = pd.read_csv(
path_in,
sep='\t',
header=None,
index_col=False
)
df.columns = ['chrom', 'start', 'end'] + list(df.columns[3:])
df = (
# filter chromosomes
df.loc[df['chrom'].isin(chrom_map.keys())]

# rename chromosomes
.pipe(lambda tb: (
tb.assign(chrom=tb['chrom'].map(chrom_map))
.astype({
'chrom': DTYPE_CHROM_NEW,
'start': int,
'end': int
})
))
)
if sort in ("true", "auto"):
df.sort_values(by=['chrom', 'start', 'end'], inplace=True)
df.to_csv(
path_out,
sep='\t',
header=False,
index=False
)


if __name__ == "__main__":
main()

0 comments on commit 008037b

Please sign in to comment.