Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A new tool to transfer RX tag from an unaligned bam. #7739

Merged
merged 11 commits into from
Apr 7, 2022
Merged

Conversation

takutosato
Copy link
Contributor

Some read tags get lost when we convert SAM to fastq. This tool allows us to get those tags back once we are done processing the fastqs (some tools e.g. adapter clippers cannot take SAMs as input so the conversion is unavoidable.) So this tool works like Picard MergeBamAlignment, except that we are putting the tags from the unaligned bam to the aligned bam, rather than adding alignment info to the unaligned bam. We will use this in our new TCap RNA pipeline.

@takutosato
Copy link
Contributor Author

@droazen, @lbergelson I have the following argument to the tool:

@Argument(fullName = "read-tags", doc = "read tag names to recover")
public List<String> readTags = DEFAULT_READ_TAGS;

On the command line I want to say

java -jar gatk.jar …
...
--read-tags RX

and want readTags to be a singleton list. But in my test it's not parsing the arguments correctly---what am I doing wrong here?

SAMFileGATKReadWriter writer;

final static String UMI_TAG = "RX";
final static List<String> DEFAULT_READ_TAGS = Arrays.asList(UMI_TAG);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
final static List<String> DEFAULT_READ_TAGS = Arrays.asList(UMI_TAG);
final static List<String> DEFAULT_READ_TAGS = new ArrayList() {{
add(UMI_TAG);
}};

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@takutosato The list returned by Arrays.asList() is immutable, so the parser can't modify the DEFAULT_READ_TAGS list. See the suggested alternative above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was getting errors (warnings, really) related to serializable UID stuff, which I don't understand well. The simpler approach seems to be to just have an empty list be the default, so that's what I did in the latest commit.


@Test
public void test(){
final String home = "/Volumes/dsde_working/tsato/hydro.gen/Analysis/874_twist_RNA/add_umi/";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I'm here, I noticed these local file refs...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted

@gatk-bot
Copy link

gatk-bot commented Mar 28, 2022

Travis reported job failures from build 38286
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 38286.12 logs
integration openjdk8 38286.2 logs

@takutosato
Copy link
Contributor Author

@cmnbroad thanks for your review. I changed the default behavior to work around the issue and now the test is passing.

@takutosato
Copy link
Contributor Author

@cmnbroad is my workaround satisfactory or should I try to figure out the serial version uid issue?

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few changes but for a new tool that is marked as beta/experimental it should be alright to merge. there are some tests and edge cases you might want to consider with your traversal though. You seem to have put a bunch of thought into iterating over sample1 assuming that sample2 might have a bunch of extra reads in it (not vice versa though) and my question is why?

@CommandLineProgramProperties(
summary = "Incorporate read tags in a SAM file to that of a matching SAM file",
oneLineSummary = "Incorporate read tags in a SAM file to that of a matching SAM file",
programGroup = QCProgramGroup.class // sato: change
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

oneLineSummary = "Incorporate read tags in a SAM file to that of a matching SAM file",
programGroup = QCProgramGroup.class // sato: change
)
public class TransferReadTags extends GATKTool {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given that this is a new tool you might consider marking this as beta or experimental.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

documentedfeature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added Experimental and DocumentedFeature.

import java.util.List;

/**
* This tool is designed for a pair of SAM files sharing the same read names (e.g.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

designed for what on that pair?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elaborated.

* due to the fact that some tools, such as the adapter clipping tools, are written for fastq only,
* while others are written specifically for SAM files.
*
* This tools behaves similarly to Picard MergeBamAlignment (MBA). The difference is that whereas
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be a doc link with {@link ....MergeBamAlignment}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public List<String> readTags = new ArrayList<>();


PeekableIterator<GATKRead> alignedSamIterator;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all of these should be private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

public File outSamFile;
SAMFileGATKReadWriter writer;

@Argument(fullName = "read-tags", doc = "read tag names to transfer")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a note here that this will explode if any read lacks the requested tag (or change that behavior)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think exiting (with a user exception) when the tag is not found is OK. Updated the doc.


@Test
public void test() {
final File alignedSAMFile = createTempFile("sam1", ".bam");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional but I would consider pulling out all of this sam construction code DataProvider that just returns paths to the two bams and a list of tags that were artificially covered. This would make it easy to test other edge cases

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speaking of edge casees. I would recommend adding a test that asserts what the behavior is if both samples have the tag present. Should it overwrite the results? I believe that is what currently happens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done dataProvider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done test with the case where the tag is already present.

@@ -56,6 +56,9 @@ public static String forumPost(String post) {
public final static String DOC_CAT_METHYLATION_DISCOVERY = "Methylation-Specific Tools";
public final static String DOC_CAT_METHYLATION_DISCOVERY_SUMMARY = "Tools that perform methylation calling, processing bisulfite sequenced, methylation-aware aligned BAM";

public final static String DOC_CAT_QC = "QC Tools";
public final static String DOC_CAT_QC_SUMMARY = "Tools that perform QC-related or other data processing tasks";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this tool fits under the banner of ReadDataManipulationProgramGroup or under Diagnostics and Quality Control which already exist as groups rather than making a new bespoke command line grouping for this one tool.

Copy link
Contributor Author

@takutosato takutosato Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted, using Diagnostics and QC ReadDataManipulation.

programGroup = QCProgramGroup.class // sato: change
)
public class TransferReadTags extends GATKTool {
final static String UMI_TAG_NAME = "RX";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleted

// will need to use this later
while (alignedSamIterator.hasNext()){
currentRead1 = alignedSamIterator.next();
int diff = queryNameComparator.compareReadNames(currentRead1, currentRead2);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about what happens when this QueryNameComparitor runs up against samtools sorted reads (which sorts the reads not lexicographicallly but lexicographically with the tile/x/y coordinates numerically sorted which we don't do by default) it seems like your traversal of pull on the target bam, then pull on the unmapped bam until you reach might run into trouble if your bam uses the samtools sorting for example on these reads: XXXXXXX:2:130:1000 followed by XXXXXXX:2:1200:1000 (which is still legal by the other style of queryname sorting). After processing the first read (at 2:130) and then loading currentRead1 with XXXXXXX:2:1200:1000 and currentRead2 will still be loaded with the last position XXXXXXX:2:130:1000. When you compute the diff with QueryNameComparitor itis going to consider XXXXXXX:2:130:1000 > XXXXXXX:2:1200:1000 and which i think will fall into the first IllegalStateException in this method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me ask you. Is there a reason you cant just pull on both iterators in tandem and blow up if there is ever a mismatch between the reads and samflags? Do you expect the unmapped bam to have more reads in it for some reason? I might consider reading AbstractAlignmentMerger.mergeAlignment() in picard to see how they handle iterating the two lists in tandem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent point. STAR aligner drops some reads when it creates the transcriptome-aligned bam, so I cannot assume that there are an equal number of reads in both aligned and unaligned. I need to think more carefully about the case you brought up though.

Copy link
Contributor Author

@takutosato takutosato Apr 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK phew it look like samtools query-name sorts the way you described, and Picard SortSam (which is what we are using in the RNA pipeline) does the string compare (see SAMRecordQueryNameComparator). So the code as is should work; how about we add the samtools version of comparator after this week's release (since we don't need it for this pipeline if not ever)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might or might not be an issue? we don't support writing bams labeled as queryname ordered in the samtools sense but we do support reading them? That will mean you get cryptic errors here if you aren't careful. Its also somewhat of an edge cases that likely won't show up in your pipelines

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't be an issue in my pipeline. But I updated the error message to make it less cryptic.

PeekableIterator<GATKRead> alignedSamIterator;
PeekableIterator<GATKRead> unmappedSamIterator;

GATKRead currentRead1;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename these to something more informative like "targetRead" and/or "annotationRead"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Done.

import org.broadinstitute.barclay.argparser.CommandLineProgramGroup;
import org.broadinstitute.hellbender.utils.help.HelpConstants;

public class QCProgramGroup implements CommandLineProgramGroup {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QC seems like a weird name for these. I would either put it either in ReadDataManipulation or DiagnosticsandQC. Alternatively, this seems like maybe a very specific tool for a certain pipeline, so maybe something like RNAReadData or something like that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think ReadDataManipulation makes sense the most.

@takutosato
Copy link
Contributor Author

@jamesemery thanks for reviewing. back to you

@gatk-bot
Copy link

gatk-bot commented Apr 6, 2022

Travis reported job failures from build 38522
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 38522.12 logs

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two very minor documentation comments and then I would say this is good to merge.

* base and adds the read tags from the unaligned bam to the aligned bam.
*
* Currently, the tool is implemented for the specific case of transfering UMI read tags (RX)
* from an unaligned bam.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add an example usage to this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

writer.addRead(updatedRead);
break;
} else {
throw new IllegalStateException("Aligned read is lexicographically smaller than the unmapped read. This tool assumes " +
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say the regular case here is that a read is missing from the unmapped bam file and I wold probably make that the primary takeaway (with a note that thsi could also be caused by sorting woes).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reworded

@takutosato takutosato merged commit 2b0a558 into master Apr 7, 2022
@takutosato takutosato deleted the ts_add_tag branch April 7, 2022 21:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants