Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: only check first column of genome file for -g option of bedtools intersect #1117

Open
bentyeh opened this issue Feb 8, 2025 · 0 comments

Comments

@bentyeh
Copy link

bentyeh commented Feb 8, 2025

Note: this could be considered a sub-issue of issue #1116; i.e., if the feature requested in issue #1116 is implemented, then this issue would become obsolete.


Issue

For bedtools intersect -sorted operations using non-lexicographically sorted chromosomes, users are instructed to supply a genome file using the -g <genome_file> option.

Currently, bedtools intersect performs at least 2 checks on the genome file:

  1. it must have 2 columns (tab-delimited)
  2. the second column (representing chromosome length) must be non-negative

For the purpose of bedtools intersect, however, the second column is irrelevant. Would it be possible to remove these 2 checks (and any additional irrelevant checks)?

Minimum working example

Consider the following files:

  • a.bed
    chr1	1	10
    
  • b.bed
    chr10	1	10
    chr1	5	15
    
  • genome1.bed
    chr10
    chr1
    
  • genome2.bed
    chr10	0
    chr1	0
    
  • genome3.bed
    chr10	1
    chr1	1
    

Let's try running bedtools intersect with the -sorted option:

bedtools intersect -a a.bed -b b.bed -sorted

# or an equivalent one-liner without creating a.bed and b.bed
# bedtools intersect -a <(echo -e "chr1\t1\t10\n") -b <(echo -e "chr10\t1\t10\nchr1\t5\t15\n") -sorted

Output:

ERROR: Database file b.bed contains chromosome chr10, but the query file does not.
       Please rerun with the -g option for a genome file.
       See documentation for details.

Now, we try to provide a 1-column file specifying our custom chromosome ordering:

bedtools intersect -a a.bed -b b.bed -sorted -g genome1.bed

# bedtools intersect -a <(echo -e "chr1\t1\t10\n") -b <(echo -e "chr10\t1\t10\nchr1\t5\t15\n") -sorted -g <(echo -e "chr10\nchr1\n")

Output

Error: The genome file genome1.bed has no valid entries (are you sure it's a 2-column bedtools genome file). Exiting.

Next, we try a 2-column genome file, with chromosome sizes set to 0

bedtools intersect -a a.bed -b b.bed -sorted -g genome2.bed

# bedtools intersect -a <(echo -e "chr1\t1\t10\n") -b <(echo -e "chr10\t1\t10\nchr1\t5\t15\n") -sorted -g <(echo -e "chr10\t0\nchr1\t0\n")

Output

Error: chr10 has length equal to 0. Each chromosome must have non-zero length. Exiting.

Finally, we try a 2-column genome file, with chromosome sizes set to 1

bedtools intersect -a a.bed -b b.bed -sorted -g genome3.bed

# bedtools intersect -a <(echo -e "chr1\t1\t10\n") -b <(echo -e "chr10\t1\t10\nchr1\t5\t15\n") -sorted -g <(echo -e "chr10\t1\nchr1\t1\n")

Output

chr1	5	10

Note that the genome file used chromosome sizes incompatible with the actual BED output, but this is not checked by bedtools: the genome file said chromosome chr1 had length 1, yet the output contains a region from positions 5 to 10. Therefore, even in the current (v.2.31.1) implementation of bedtools, the 2nd column of the genome file is clearly irrelevant to the actual intersect operation.

Real-world motivation

My motivating example was trying to filter out reads in a BAM file that overlapped specific regions. Both my BAM file and my mask BED file were pre-sorted according to an alphanumeric chromosome ordering (e.g., chr1, chr2, ..., chr10, chr11, ...).

Consequently, bedtools intersect -sorted requires that I provide a genome file for non-lexicographically sorted chromosomes. Ideally, given that both inputs were already grouped by chromosome (and coordinate sorted within each chromosome group), there would be no need to provide a genome file (see issue #1116). But if that would be harder to implement, at least the requirement for a 2-column genome file could be eliminated; a simple 1-column text file specifying the chromosome order should suffice.

bentyeh added a commit to GuttmanLab/chipdip-pipeline that referenced this issue Feb 8, 2025
…st mask

Background
- To reduce memory usage when applying the blacklist mask (rule repeat_mask), the option -sorted was added to the bedtools intersect command in commit 2703961.
- bedtools intersect with the -sorted option requires that all input files be position-sorted.

Problem: the input files were sorted, but not in the way bedtools intersect expected them to be
- Input 1 = chromatin alignment BAM file
  - This is produced by Bowtie 2 followed by samtools sort (rule bowtie2_align), then the chromosomes are renamed in the rule rename_and_filter_chr.
    - The chromosome renaming step is essential to ensure that the chromosome names in the BAM file (which originally come from the Bowtie 2 index) match the chromosome names in the blacklist mask file.
    - This BAM file is coordinate sorted using the order of chromosomes in the BAM header, which uses the order of chromosomes in the chromosome name map file. Importantly, this order is user-defined and is not necessarily lexicographically ordered!
- Input 2 = blacklist mask file
  - In the original upstream merge_mask rule, this mask was coordinate sorted using sort -k1,1 -k2,2n, which will lexicographically order the chromosome names.

Fix: The fix consisted of 2 changes
1. Use the same chromosome sorting order for the BAM alignment file as the mask. Specifically, sort the mask file using the order of chromosomes in the chromosome name map file.
2. bedtools intersect with the -sorted flag alone appears to expect chromosomes sorted lexicographically. Consequently, a "genome file" (bedtools terminology) needs to be passed using the -g option that specifies the order of chromosomes desired from the chromosome name map. See also issues 1116 and 1117 on the bedtools GitHub repository (arq5x/bedtools2#1116, arq5x/bedtools2#1117).

Other changes
- Removed documentation in README.md and code comments regarding requirement for Python 3.9+.
  - Commit 85f3574 removed the use of the Python 3.9+-specific dict | dict operation in rename_and_filter_chr.py:reheader()
- helpers.py:parse_chrom_map() now uses file_open() instead of open() for reading a chromosome name map file
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant