You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: this could be considered a sub-issue of issue #1116; i.e., if the feature requested in issue #1116 is implemented, then this issue would become obsolete.
Issue
For bedtools intersect -sorted operations using non-lexicographically sorted chromosomes, users are instructed to supply a genome file using the -g <genome_file> option.
Currently, bedtools intersect performs at least 2 checks on the genome file:
it must have 2 columns (tab-delimited)
the second column (representing chromosome length) must be non-negative
For the purpose of bedtools intersect, however, the second column is irrelevant. Would it be possible to remove these 2 checks (and any additional irrelevant checks)?
Minimum working example
Consider the following files:
a.bed
chr1 1 10
b.bed
chr10 1 10
chr1 5 15
genome1.bed
chr10
chr1
genome2.bed
chr10 0
chr1 0
genome3.bed
chr10 1
chr1 1
Let's try running bedtools intersect with the -sorted option:
bedtools intersect -a a.bed -b b.bed -sorted
# or an equivalent one-liner without creating a.bed and b.bed# bedtools intersect -a <(echo -e "chr1\t1\t10\n") -b <(echo -e "chr10\t1\t10\nchr1\t5\t15\n") -sorted
Output:
ERROR: Database file b.bed contains chromosome chr10, but the query file does not.
Please rerun with the -g option for a genome file.
See documentation for details.
Now, we try to provide a 1-column file specifying our custom chromosome ordering:
Note that the genome file used chromosome sizes incompatible with the actual BED output, but this is not checked by bedtools: the genome file said chromosome chr1 had length 1, yet the output contains a region from positions 5 to 10. Therefore, even in the current (v.2.31.1) implementation of bedtools, the 2nd column of the genome file is clearly irrelevant to the actual intersect operation.
Real-world motivation
My motivating example was trying to filter out reads in a BAM file that overlapped specific regions. Both my BAM file and my mask BED file were pre-sorted according to an alphanumeric chromosome ordering (e.g., chr1, chr2, ..., chr10, chr11, ...).
Consequently, bedtools intersect -sorted requires that I provide a genome file for non-lexicographically sorted chromosomes. Ideally, given that both inputs were already grouped by chromosome (and coordinate sorted within each chromosome group), there would be no need to provide a genome file (see issue #1116). But if that would be harder to implement, at least the requirement for a 2-column genome file could be eliminated; a simple 1-column text file specifying the chromosome order should suffice.
The text was updated successfully, but these errors were encountered:
bentyeh
added a commit
to GuttmanLab/chipdip-pipeline
that referenced
this issue
Feb 8, 2025
…st mask
Background
- To reduce memory usage when applying the blacklist mask (rule repeat_mask), the option -sorted was added to the bedtools intersect command in commit 2703961.
- bedtools intersect with the -sorted option requires that all input files be position-sorted.
Problem: the input files were sorted, but not in the way bedtools intersect expected them to be
- Input 1 = chromatin alignment BAM file
- This is produced by Bowtie 2 followed by samtools sort (rule bowtie2_align), then the chromosomes are renamed in the rule rename_and_filter_chr.
- The chromosome renaming step is essential to ensure that the chromosome names in the BAM file (which originally come from the Bowtie 2 index) match the chromosome names in the blacklist mask file.
- This BAM file is coordinate sorted using the order of chromosomes in the BAM header, which uses the order of chromosomes in the chromosome name map file. Importantly, this order is user-defined and is not necessarily lexicographically ordered!
- Input 2 = blacklist mask file
- In the original upstream merge_mask rule, this mask was coordinate sorted using sort -k1,1 -k2,2n, which will lexicographically order the chromosome names.
Fix: The fix consisted of 2 changes
1. Use the same chromosome sorting order for the BAM alignment file as the mask. Specifically, sort the mask file using the order of chromosomes in the chromosome name map file.
2. bedtools intersect with the -sorted flag alone appears to expect chromosomes sorted lexicographically. Consequently, a "genome file" (bedtools terminology) needs to be passed using the -g option that specifies the order of chromosomes desired from the chromosome name map. See also issues 1116 and 1117 on the bedtools GitHub repository (arq5x/bedtools2#1116, arq5x/bedtools2#1117).
Other changes
- Removed documentation in README.md and code comments regarding requirement for Python 3.9+.
- Commit 85f3574 removed the use of the Python 3.9+-specific dict | dict operation in rename_and_filter_chr.py:reheader()
- helpers.py:parse_chrom_map() now uses file_open() instead of open() for reading a chromosome name map file
Note: this could be considered a sub-issue of issue #1116; i.e., if the feature requested in issue #1116 is implemented, then this issue would become obsolete.
Issue
For
bedtools intersect -sorted
operations using non-lexicographically sorted chromosomes, users are instructed to supply a genome file using the-g <genome_file>
option.Currently,
bedtools intersect
performs at least 2 checks on the genome file:For the purpose of
bedtools intersect
, however, the second column is irrelevant. Would it be possible to remove these 2 checks (and any additional irrelevant checks)?Minimum working example
Consider the following files:
Let's try running
bedtools intersect
with the-sorted
option:Output:
Now, we try to provide a 1-column file specifying our custom chromosome ordering:
bedtools intersect -a a.bed -b b.bed -sorted -g genome1.bed # bedtools intersect -a <(echo -e "chr1\t1\t10\n") -b <(echo -e "chr10\t1\t10\nchr1\t5\t15\n") -sorted -g <(echo -e "chr10\nchr1\n")
Output
Next, we try a 2-column genome file, with chromosome sizes set to 0
bedtools intersect -a a.bed -b b.bed -sorted -g genome2.bed # bedtools intersect -a <(echo -e "chr1\t1\t10\n") -b <(echo -e "chr10\t1\t10\nchr1\t5\t15\n") -sorted -g <(echo -e "chr10\t0\nchr1\t0\n")
Output
Finally, we try a 2-column genome file, with chromosome sizes set to 1
bedtools intersect -a a.bed -b b.bed -sorted -g genome3.bed # bedtools intersect -a <(echo -e "chr1\t1\t10\n") -b <(echo -e "chr10\t1\t10\nchr1\t5\t15\n") -sorted -g <(echo -e "chr10\t1\nchr1\t1\n")
Output
Note that the genome file used chromosome sizes incompatible with the actual BED output, but this is not checked by
bedtools
: the genome file said chromosome chr1 had length 1, yet the output contains a region from positions 5 to 10. Therefore, even in the current (v.2.31.1) implementation ofbedtools
, the 2nd column of the genome file is clearly irrelevant to the actual intersect operation.Real-world motivation
My motivating example was trying to filter out reads in a BAM file that overlapped specific regions. Both my BAM file and my mask BED file were pre-sorted according to an alphanumeric chromosome ordering (e.g.,
chr1
,chr2
, ...,chr10
,chr11
, ...).Consequently,
bedtools intersect -sorted
requires that I provide a genome file for non-lexicographically sorted chromosomes. Ideally, given that both inputs were already grouped by chromosome (and coordinate sorted within each chromosome group), there would be no need to provide a genome file (see issue #1116). But if that would be harder to implement, at least the requirement for a 2-column genome file could be eliminated; a simple 1-column text file specifying the chromosome order should suffice.The text was updated successfully, but these errors were encountered: