Bug fix: correct sorting for bedtools intersect when applying blackli…

…st mask Background - To reduce memory usage when applying the blacklist mask (rule repeat_mask), the option -sorted was added to the bedtools intersect command in commit 2703961. - bedtools intersect with the -sorted option requires that all input files be position-sorted. Problem: the input files were sorted, but not in the way bedtools intersect expected them to be - Input 1 = chromatin alignment BAM file - This is produced by Bowtie 2 followed by samtools sort (rule bowtie2_align), then the chromosomes are renamed in the rule rename_and_filter_chr. - The chromosome renaming step is essential to ensure that the chromosome names in the BAM file (which originally come from the Bowtie 2 index) match the chromosome names in the blacklist mask file. - This BAM file is coordinate sorted using the order of chromosomes in the BAM header, which uses the order of chromosomes in the chromosome name map file. Importantly, this order is user-defined and is not necessarily lexicographically ordered! - Input 2 = blacklist mask file - In the original upstream merge_mask rule, this mask was coordinate sorted using sort -k1,1 -k2,2n, which will lexicographically order the chromosome names. Fix: The fix consisted of 2 changes 1. Use the same chromosome sorting order for the BAM alignment file as the mask. Specifically, sort the mask file using the order of chromosomes in the chromosome name map file. 2. bedtools intersect with the -sorted flag alone appears to expect chromosomes sorted lexicographically. Consequently, a "genome file" (bedtools terminology) needs to be passed using the -g option that specifies the order of chromosomes desired from the chromosome name map. See also issues 1116 and 1117 on the bedtools GitHub repository (arq5x/bedtools2#1116, arq5x/bedtools2#1117). Other changes - Removed documentation in README.md and code comments regarding requirement for Python 3.9+. - Commit 85f3574 removed the use of the Python 3.9+-specific dict | dict operation in rename_and_filter_chr.py:reheader() - helpers.py:parse_chrom_map() now uses file_open() instead of open() for reading a chromosome name map file
GuttmanLab · Feb 8, 2025 · 008037b · 008037b
1 parent 08f7306
commit 008037b
Show file tree

Hide file tree

Showing 4 changed files with 139 additions and 27 deletions.
diff --git a/README.md b/README.md
@@ -92,8 +92,7 @@ Code intepreters and runtimes: The pipeline relies on scripts written in Java, B
 - Python: 3.9+ (the `envs/chipdip.yaml` conda environment file currently uses version 3.10)
 
 Packages: Additional required third-party programs and packages are specified in conda environments described in `envs/`.
-- Note: Other versions of the same software programs will likely work, but we have not tested all of them. Some specific requirements are discussed below.
-  - Some of the scripts (such as `rename_and_filter_chr.py`) take advantage of features introduced in Python 3.9.
+- Note: Other recent versions of the same software programs will likely work, but we have not tested all of them. Some specific requirements are discussed below.
   - The version of `deeptools` used (3.5.2) requires a `matplotlib` version between 3.1.0 and 3.7.5, since later versions of matplotlib deprecate some functions used by `deeptools` version 3.5.2. Newer versions of `deeptools` have been updated to support the newer `matplotlib` APIs.
   - NumPy version 2.x is not currently supported.
 
@@ -356,7 +355,7 @@ However, the pipeline directory can also be kept separate and used repeatedly on
 7. <a name="chrom-map">`chrom_map.txt`</a>: Chromosome names map - tab-delimited text file specifying which chromosomes from the Bowtie 2 index to keep and how to rename them (if at all).
    - Required? No, but necessary if using a [blacklist mask](#blacklist-bed) that uses different chromosome names than used in the Bowtie 2 index.
    - [`config.yaml`](#config-yaml) key to specify the path to this file: `path_chrom_map`
-   - Used by: `scripts/python/rename_and_filter_chr.py` (Snakefile `rule rename_and_filter_chr` and `rule effective_genome_size`)
+   - Used by: `scripts/python/rename_and_filter_chr.py` (Snakefile `rule rename_and_filter_chr`, `rule merge_mask`, and `rule effective_genome_size`)
    - Column 1 specifies chromosomes (following naming convention used in the index) to keep.
      - The order of chromosomes provided here is maintained in the SAM/BAM file
        header, and consequently specifies the coordinate sorting order at the

diff --git a/Snakefile b/Snakefile
@@ -166,6 +166,7 @@ else:
 
 path_chrom_map = config.get("path_chrom_map")
 if path_chrom_map in (None, ""):
+    path_chrom_map = ""
     print("Chromosome names not specified, will use all chromosomes in the Bowtie 2 index.",
           file=sys.stderr)
 
@@ -774,7 +775,7 @@ rule rename_and_filter_chr:
     log:
         os.path.join(DIR_LOGS, "{sample}.{splitid}.rename_and_filter_chr.log")
     params:
-        chrom_map = f"--chrom_map '{path_chrom_map}'" if path_chrom_map not in (None, "") else ""
+        chrom_map = f"--chrom_map '{path_chrom_map}'" if path_chrom_map != "" else ""
     conda:
         conda_env
     threads:
@@ -786,27 +787,60 @@ rule rename_and_filter_chr:
         '''
 
 # Merge mask
-# - This should increase the speed of the repeat_mask rule compared to a bed file with many overlapping regions.
-# - The merged mask is also used in the generate_bigwigs rule, if that is enabled.
+# - Merging overlapping regions should increase the speed of running bedtools intersect in the repeat_mask rule.
+# - The merged mask is also used in the generate_bigwigs rule.
+# - The merged mask is sorted as follows:
+#   - If a chromosome name map is provided, then it is sorted by the new chromosome names (i.e., according to names in
+#     the second column of the chromosome name map file). Entries with chromosome names not in the chromosome name map
+#     are discarded.
+#   - If a chromosome name map is not provided, then it is sorted by the order of chromosomes in the Bowtie 2 index.
 rule merge_mask:
     output:
-        temp(os.path.join(DIR_WORKUP, "mask_merge.bed"))
+        bed = temp(os.path.join(DIR_WORKUP, "mask_merge.bed")),
+        genome = temp(os.path.join(DIR_WORKUP, "mask_merge.genome"))
     conda:
         conda_env
     shell:
         '''
         if [ -n "{mask}" ]; then
-            sort -k1,1 -k2,2n "{mask}" | bedtools merge > "{output}"
+            if [ -n "{path_chrom_map}" ]; then
+                # chromosome name map is provided --> sort chromosomes by the new chromosome names
+                sort -k1,1 -k2,2n "{mask}" |
+                bedtools merge |
+                python "{rename_and_filter_chr}" --bed --chrom_map "{path_chrom_map}" - > "{output.bed}"
+
+                # create genome file for bedtools intersect
+                join -t $'\t' \
+                    <(grep -E -e '^[^"]\s*\S+\s*' "{path_chrom_map}" ) \
+                    <(bowtie2-inspect -s "{bowtie2_index}" |
+                      grep -E -e '^Sequence-[0-9]+' |
+                      cut -f 2,3) |
+                cut -f 2,3 > "{output.genome}"
+            else
+                # chromosome name map is not provided --> sort chromosomes by their order in the Bowtie 2 index
+                sort -k1,1 -k2,2n "{mask}" |
+                bedtools merge |
+                python "{rename_and_filter_chr}" \
+                    --bed \
+                    --chrom_map <(bowtie2-inspect -n "{bowtie2_index}" | sed -E 's/(\S+)/\1\t\1/') \
+                    - > "{output.bed}"
+
+                # create genome file for bedtools intersect
+                bowtie2-inspect -s "{bowtie2_index}" |
+                grep -E -e '^Sequence-[0-9]+' |
+                cut -f 2,3 > "{output.genome}"
+            fi
         else
-            touch "{output}"
+            touch "{output.bed}" "{output.genome}"
         fi
         '''
 
 # Repeat mask aligned DNA reads
 rule repeat_mask:
     input:
         bam = os.path.join(DIR_WORKUP, "alignments_parts", "{sample}.part_{splitid}.DNA.chr.bam"),
-        mask = os.path.join(DIR_WORKUP, "mask_merge.bed")
+        mask = os.path.join(DIR_WORKUP, "mask_merge.bed"),
+        genome = os.path.join(DIR_WORKUP, "mask_merge.genome")
     output:
         os.path.join(DIR_WORKUP, "alignments_parts", "{sample}.part_{splitid}.DNA.chr.masked.bam")
     log:
@@ -817,7 +851,13 @@ rule repeat_mask:
         '''
         {{
             if [ -n "{mask}" ]; then
-                bedtools intersect -v -a "{input.bam}" -b "{input.mask}" -sorted > "{output}"
+                # -v: only report entries in A that have no overlap in B
+                bedtools intersect \
+                    -v \
+                    -a "{input.bam}" \
+                    -b "{input.mask}" \
+                    -sorted \
+                    -g "{input.genome}" > "{output}"
             else
                 echo "No mask file specified, skipping masking."
                 ln -s "{input.bam}" "{output}"
@@ -1211,7 +1251,7 @@ rule effective_genome_size:
     log:
         os.path.join(DIR_LOGS, "effective_genome_size.log")
     params:
-        chrom_map = f"--chrom_map '{path_chrom_map}'" if path_chrom_map not in (None, "") else "",
+        chrom_map = f"--chrom_map '{path_chrom_map}'" if path_chrom_map != "" else ""
     conda:
         conda_env
     threads:

diff --git a/scripts/python/helpers.py b/scripts/python/helpers.py
@@ -108,7 +108,7 @@ def parse_chrom_map(path):
     Parse a chromosome name map file to a dict mapping old chromosome names to new names.
     """
     chrom_map = dict()
-    with open(path, 'rt') as f:
+    with file_open(path, mode='rt') as f:
         for line in f:
             if line.strip() == '' or line.strip().startswith('"'):
                 continue

diff --git a/scripts/python/rename_and_filter_chr.py b/scripts/python/rename_and_filter_chr.py
@@ -1,5 +1,5 @@
 """
-Rename and select chromosomes in a FASTA or BAM file.
+Rename chromosomes in a FASTA, BED, or BAM file.
 """
 
 import argparse
@@ -12,6 +12,7 @@
 import sys
 sys.path.append(os.path.abspath(os.path.dirname(__file__)))
 import helpers
+import pandas as pd
 import pysam
 
 
@@ -68,6 +69,13 @@ def main():
                 args.output,
                 chrom_map
             )
+        elif args.bed:
+            filter_bed(
+                args.input,
+                args.output,
+                chrom_map,
+                sort=args.sort
+            )
         else:
             filter_reads(
                 args.input,
@@ -81,28 +89,32 @@ def main():
             )
 
 
-
 def parse_arguments():
     parser = argparse.ArgumentParser(
         description=(
-            "Rename and select chromosomes in a FASTA or BAM file. "
-            "For BAM files, keep only reads aligned to selected chromosomes, and "
-            "reorder the chromosomes in the BAM file header."
+            "Rename chromosomes. For BAM/BED inputs, keep only reads/regions on those chromosomes. Additionally for "
+            "BAM files, reorder the chromosomes in the BAM file header."
         )
     )
     parser.add_argument(
         "input",
         metavar="in.bam|in.fasta(.gz)|-",
         help=(
-            "Input file. Assumed to be BAM format unless the -f/--fasta flag is used. "
+            "Input file. Assumed to be BAM format unless the -f/--fasta or -b/--bed flags are used. "
             "Use '-' for standard input."
         )
     )
-    parser.add_argument(
+    group = parser.add_mutually_exclusive_group()
+    group.add_argument(
         "-f", "--fasta",
         action="store_true",
         help="Input file is FASTA format."
     )
+    group.add_argument(
+        "-b", "--bed",
+        action="store_true",
+        help="Input file is BED format. (Current implementation is not memory efficient.)"
+    )
     parser.add_argument(
         "-o", "--output",
         metavar="out.bam|out.fasta",
@@ -134,10 +146,10 @@ def parse_arguments():
         choices=("auto", "true", "false"),
         default="auto",
         help=(
-            "(Only relevant if -c/--chrom_map is specified and the input is a BAM file) "
-            "Whether to sort the processed BAM file. "
-            "If 'auto', will only re-sort if the order of chromosomes given in the chromosome name map "
-            "is different than the existing chromosome order."
+            "Only relevant if -c/--chrom_map is specified and input type is not FASTA. "
+            "(For BAM input) If 'auto', will only re-sort if the order of chromosomes given in the chromosome name map "
+            "is different than the existing chromosome order. "
+            "(For BED input) If 'auto' or 'true', sort based on the order of chromosomes in the chromosome name map."
         )
     )
     parser.add_argument(
@@ -177,9 +189,6 @@ def reheader(old_header, chrom_map):
     - retains_sorting: bool
         Whether the order of entries in chrom_map is consistent with existing order
         of chromosomes in the old header.
-
-    Note: The current implementation of this function takes advantage of dict features
-    defined/introduced in Python versions >= 3.9.
     """
     # copy existing BAM header to a new header dict without reference sequences
     new_header = copy.deepcopy(old_header)
@@ -311,6 +320,9 @@ def filter_fasta(path_in, path_out, chrom_map) -> None:
     # convert path_in to a text IO stream if necessary
     if isinstance(path_in, io.IOBase) and not isinstance(path_in, io.TextIOBase):
         first_two_bytes = path_in.peek(2)[:2]
+        if len(first_two_bytes) != 2:
+            first_two_bytes = path_in.read(2)
+            path_in.seek(0)
         if first_two_bytes == helpers.GZIP_MAGIC_NUMBER:
             f = gzip.GzipFile(fileobj=path_in, mode="rb")
         else:
@@ -343,5 +355,66 @@ def filter_fasta(path_in, path_out, chrom_map) -> None:
         elif include:
             path_out.write(line)
 
+
+def filter_bed(path_in, path_out, chrom_map, sort="auto"):
+    """
+    Rename chromosomes in a BED file according to a chromosome name map, and and output only regions on chromosomes in
+    the chromosome name map.
+
+    Args
+    - path_in: str or file object
+        Path to input BED file or file object.
+        Supports normal text or gzip-compressed text.
+    - path_out: str, file object, or None
+        Path to output BED file (supports .gz extension for gzip compression) or file object.
+        If None, writes to standard out.
+    - chrom_map: dict (str -> str)
+        Map from old reference sequence names to new reference sequence names.
+    - sort: ('auto', 'true', or 'false'). default='auto'
+        Whether to coordinate sort the reads before writing to path_out.
+    """
+    if path_out is None:
+        path_out = sys.stdout
+
+    if isinstance(path_in, io.IOBase) and not isinstance(path_in, io.TextIOBase):
+        first_two_bytes = path_in.peek(2)[:2]
+        if len(first_two_bytes) != 2:
+            first_two_bytes = path_in.read(2)
+            path_in.seek(0)
+        if first_two_bytes == helpers.GZIP_MAGIC_NUMBER:
+            path_in = gzip.GzipFile(fileobj=path_in, mode="rb")
+
+    DTYPE_CHROM_NEW = pd.CategoricalDtype(categories=chrom_map.values(), ordered=True)
+    df = pd.read_csv(
+        path_in,
+        sep='\t',
+        header=None,
+        index_col=False
+    )
+    df.columns = ['chrom', 'start', 'end'] + list(df.columns[3:])
+    df = (
+        # filter chromosomes
+        df.loc[df['chrom'].isin(chrom_map.keys())]
+
+        # rename chromosomes
+        .pipe(lambda tb: (
+            tb.assign(chrom=tb['chrom'].map(chrom_map))
+            .astype({
+                'chrom': DTYPE_CHROM_NEW,
+                'start': int,
+                'end': int
+            })
+        ))
+    )
+    if sort in ("true", "auto"):
+        df.sort_values(by=['chrom', 'start', 'end'], inplace=True)
+    df.to_csv(
+        path_out,
+        sep='\t',
+        header=False,
+        index=False
+    )
+
+
 if __name__ == "__main__":
     main()