-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A new tool to transfer RX tag from an unaligned bam. #7739
Conversation
@droazen, @lbergelson I have the following argument to the tool: @Argument(fullName = "read-tags", doc = "read tag names to recover")
public List<String> readTags = DEFAULT_READ_TAGS; On the command line I want to say
and want readTags to be a singleton list. But in my test it's not parsing the arguments correctly---what am I doing wrong here? |
SAMFileGATKReadWriter writer; | ||
|
||
final static String UMI_TAG = "RX"; | ||
final static List<String> DEFAULT_READ_TAGS = Arrays.asList(UMI_TAG); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
final static List<String> DEFAULT_READ_TAGS = Arrays.asList(UMI_TAG); | |
final static List<String> DEFAULT_READ_TAGS = new ArrayList() {{ | |
add(UMI_TAG); | |
}}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato The list returned by Arrays.asList() is immutable, so the parser can't modify the DEFAULT_READ_TAGS
list. See the suggested alternative above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was getting errors (warnings, really) related to serializable UID stuff, which I don't understand well. The simpler approach seems to be to just have an empty list be the default, so that's what I did in the latest commit.
|
||
@Test | ||
public void test(){ | ||
final String home = "/Volumes/dsde_working/tsato/hydro.gen/Analysis/874_twist_RNA/add_umi/"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I'm here, I noticed these local file refs...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted
@cmnbroad thanks for your review. I changed the default behavior to work around the issue and now the test is passing. |
@cmnbroad is my workaround satisfactory or should I try to figure out the serial version uid issue? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few changes but for a new tool that is marked as beta/experimental it should be alright to merge. there are some tests and edge cases you might want to consider with your traversal though. You seem to have put a bunch of thought into iterating over sample1 assuming that sample2 might have a bunch of extra reads in it (not vice versa though) and my question is why?
@CommandLineProgramProperties( | ||
summary = "Incorporate read tags in a SAM file to that of a matching SAM file", | ||
oneLineSummary = "Incorporate read tags in a SAM file to that of a matching SAM file", | ||
programGroup = QCProgramGroup.class // sato: change |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
todo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
oneLineSummary = "Incorporate read tags in a SAM file to that of a matching SAM file", | ||
programGroup = QCProgramGroup.class // sato: change | ||
) | ||
public class TransferReadTags extends GATKTool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
given that this is a new tool you might consider marking this as beta or experimental.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
documentedfeature
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added Experimental and DocumentedFeature.
import java.util.List; | ||
|
||
/** | ||
* This tool is designed for a pair of SAM files sharing the same read names (e.g. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
designed for what on that pair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elaborated.
* due to the fact that some tools, such as the adapter clipping tools, are written for fastq only, | ||
* while others are written specifically for SAM files. | ||
* | ||
* This tools behaves similarly to Picard MergeBamAlignment (MBA). The difference is that whereas |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a doc link with {@link ....MergeBamAlignment}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public List<String> readTags = new ArrayList<>(); | ||
|
||
|
||
PeekableIterator<GATKRead> alignedSamIterator; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
all of these should be private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
public File outSamFile; | ||
SAMFileGATKReadWriter writer; | ||
|
||
@Argument(fullName = "read-tags", doc = "read tag names to transfer") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a note here that this will explode if any read lacks the requested tag (or change that behavior)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think exiting (with a user exception) when the tag is not found is OK. Updated the doc.
|
||
@Test | ||
public void test() { | ||
final File alignedSAMFile = createTempFile("sam1", ".bam"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional but I would consider pulling out all of this sam construction code DataProvider that just returns paths to the two bams and a list of tags that were artificially covered. This would make it easy to test other edge cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speaking of edge casees. I would recommend adding a test that asserts what the behavior is if both samples have the tag present. Should it overwrite the results? I believe that is what currently happens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done dataProvider
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done test with the case where the tag is already present.
@@ -56,6 +56,9 @@ public static String forumPost(String post) { | |||
public final static String DOC_CAT_METHYLATION_DISCOVERY = "Methylation-Specific Tools"; | |||
public final static String DOC_CAT_METHYLATION_DISCOVERY_SUMMARY = "Tools that perform methylation calling, processing bisulfite sequenced, methylation-aware aligned BAM"; | |||
|
|||
public final static String DOC_CAT_QC = "QC Tools"; | |||
public final static String DOC_CAT_QC_SUMMARY = "Tools that perform QC-related or other data processing tasks"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think this tool fits under the banner of ReadDataManipulationProgramGroup
or under Diagnostics and Quality Control
which already exist as groups rather than making a new bespoke command line grouping for this one tool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted, using Diagnostics and QC ReadDataManipulation.
programGroup = QCProgramGroup.class // sato: change | ||
) | ||
public class TransferReadTags extends GATKTool { | ||
final static String UMI_TAG_NAME = "RX"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted
// will need to use this later | ||
while (alignedSamIterator.hasNext()){ | ||
currentRead1 = alignedSamIterator.next(); | ||
int diff = queryNameComparator.compareReadNames(currentRead1, currentRead2); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little worried about what happens when this QueryNameComparitor runs up against samtools sorted reads (which sorts the reads not lexicographicallly but lexicographically with the tile/x/y coordinates numerically sorted which we don't do by default) it seems like your traversal of pull on the target bam, then pull on the unmapped bam until you reach might run into trouble if your bam uses the samtools sorting for example on these reads: XXXXXXX:2:130:1000
followed by XXXXXXX:2:1200:1000
(which is still legal by the other style of queryname sorting). After processing the first read (at 2:130) and then loading currentRead1 with XXXXXXX:2:1200:1000
and currentRead2 will still be loaded with the last position XXXXXXX:2:130:1000
. When you compute the diff with QueryNameComparitor itis going to consider XXXXXXX:2:130:1000
> XXXXXXX:2:1200:1000
and which i think will fall into the first IllegalStateException in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me ask you. Is there a reason you cant just pull on both iterators in tandem and blow up if there is ever a mismatch between the reads and samflags? Do you expect the unmapped bam to have more reads in it for some reason? I might consider reading AbstractAlignmentMerger.mergeAlignment()
in picard to see how they handle iterating the two lists in tandem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent point. STAR aligner drops some reads when it creates the transcriptome-aligned bam, so I cannot assume that there are an equal number of reads in both aligned and unaligned. I need to think more carefully about the case you brought up though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK phew it look like samtools query-name sorts the way you described, and Picard SortSam (which is what we are using in the RNA pipeline) does the string compare (see SAMRecordQueryNameComparator
). So the code as is should work; how about we add the samtools version of comparator after this week's release (since we don't need it for this pipeline if not ever)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might or might not be an issue? we don't support writing bams labeled as queryname ordered in the samtools sense but we do support reading them? That will mean you get cryptic errors here if you aren't careful. Its also somewhat of an edge cases that likely won't show up in your pipelines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't be an issue in my pipeline. But I updated the error message to make it less cryptic.
PeekableIterator<GATKRead> alignedSamIterator; | ||
PeekableIterator<GATKRead> unmappedSamIterator; | ||
|
||
GATKRead currentRead1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename these to something more informative like "targetRead" and/or "annotationRead"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Done.
import org.broadinstitute.barclay.argparser.CommandLineProgramGroup; | ||
import org.broadinstitute.hellbender.utils.help.HelpConstants; | ||
|
||
public class QCProgramGroup implements CommandLineProgramGroup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
QC seems like a weird name for these. I would either put it either in ReadDataManipulation
or DiagnosticsandQC
. Alternatively, this seems like maybe a very specific tool for a certain pipeline, so maybe something like RNAReadData
or something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think ReadDataManipulation makes sense the most.
@jamesemery thanks for reviewing. back to you |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two very minor documentation comments and then I would say this is good to merge.
* base and adds the read tags from the unaligned bam to the aligned bam. | ||
* | ||
* Currently, the tool is implemented for the specific case of transfering UMI read tags (RX) | ||
* from an unaligned bam. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add an example usage to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
writer.addRead(updatedRead); | ||
break; | ||
} else { | ||
throw new IllegalStateException("Aligned read is lexicographically smaller than the unmapped read. This tool assumes " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say the regular case here is that a read is missing from the unmapped bam file and I wold probably make that the primary takeaway (with a note that thsi could also be caused by sorting woes).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reworded
Some read tags get lost when we convert SAM to fastq. This tool allows us to get those tags back once we are done processing the fastqs (some tools e.g. adapter clippers cannot take SAMs as input so the conversion is unavoidable.) So this tool works like Picard MergeBamAlignment, except that we are putting the tags from the unaligned bam to the aligned bam, rather than adding alignment info to the unaligned bam. We will use this in our new TCap RNA pipeline.