-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Preprocessing step for RSEM #7752
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato I made probably extremely overly nitpicky comments on this. It's unclear to me if this is supposed to be a general user tool or a specific wierd one off for an internal pipeline. If that's the case you could probably just tag it as @Experimental
or something and ignore most of my comments.
This makes me want to introduce a ReadNameGroupWalker that does the grouping in the background. SortedPairWalker? I'm not sure what I'd call it... Ted wrote a much more exciting PairWalker class but it doesn't deal with secondary reads in the way you need.
* | ||
* ### Task 2. Removing Reads ### | ||
* | ||
* If requested, this tool also removes duplicate marked reads and MT reads, which can skew gene expression counts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this set of comments. This says it can remove them and then caveat says it doesn't.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
contradiction removed.
* | ||
* | ||
* Caveat: This tool does not remove duplicate reads; it assumes it's been removed upstream e.g. | ||
* MarkDuplicates with R |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark duplicates with R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened was I left IntelliJ to look up what the exact argument name was (it's REMOVE_DUPLICATES) and then got distracted and went on to other places in the code.
* | ||
* ### Task 2. Removing Reads ### | ||
* | ||
* If requested, this tool also removes duplicate marked reads and MT reads, which can skew gene expression counts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be done by MarkDuplicates itself, (or a read filter) unless I misunderstand what you're doing. If the reads are aligned can't you use -XL to screen out the midochondrial reads? It's fine to bake these things into the tool to make it foolproof but I'm not clear why it's necessary to do specially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the hardcoded list of mitochondrial transcripts below it seems like it would be better to pass in an exclusion file with -XL to remove mitocondria instead of hardcoding the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed hardcoded MT contigs and created an interval list.
@CommandLineProgramProperties( | ||
summary = "", | ||
oneLineSummary = "", | ||
programGroup = ReadDataManipulationProgramGroup.class // Sato: Change to QC when the other PR is merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I commented in the other pr but this seems better to me than QC? This isn't really checking anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping ReadDataManipulation
@Argument(fullName = StandardArgumentDefinitions.OUTPUT_LONG_NAME, shortName = StandardArgumentDefinitions.OUTPUT_SHORT_NAME) | ||
public File outSam; | ||
|
||
@Argument(fullName = "keep-MT-reads") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lowercase all our full names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to know, this option is now removed
r.getStart() == read1.getMateStart() && r.getMateStart() == read1.getStart()).findFirst(); | ||
if (read2.isPresent()){ | ||
result.add(new ImmutablePair<>(read1, read2.get())); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to also warn on the case where the are multiple matchess?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
} | ||
} | ||
|
||
// Supplementary reads are not handled i.e. removed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add a default read filter to remove supplementary reads up front. You can do that by overriding getDefaultReadFilters()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
throw new UserException("Read names do not match: " + this.queryName + " vs " + read.getName()); | ||
} | ||
|
||
if (isPrimaryAlignment(read) && read.isFirstOfPair()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these throw if first of pair is already set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup
// If either of the pair is unmapped, throw out. | ||
// With the STAR argument --quantTranscriptomeBan IndelSoftclipSingleend this should not occur, | ||
// but we check just to be thorough. | ||
if (read1.getContig() == null || read2.getContig() == null){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If either read1 or read2 is missing this will crash with a NPE. That's probably something that should be handled since it tends to happen. A crash is a fine answer but detect it and write a sane error message.
Since it seems like you do require both a first and second in pair, one way to simplify some things would be to push these checks into ReadFilters and then at this point you'll just check if there is a full set of reads. It would change the counting of the elimination reason though from being per fragment group to being by read which might not be what you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem to happen. But I'm going to invoke the @droazen rule and do the pragmatic thing, which is that if either read1 or read2 is null, we return false (so these reads will not be output).
final List<CigarElement> cigarElements1 = read1.getCigar().getCigarElements(); | ||
final List<CigarElement> cigarElements2 = read2.getCigar().getCigarElements(); | ||
|
||
if (cigarElements1.size() != 1 || cigarElements2.size() != 1){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might pull out a little function and call it twice instead of duplicating the checks.
@lbergelson thanks again for reviewing, back to you. |
|
||
@Override | ||
public List<ReadFilter> getDefaultReadFilters() { | ||
return Collections.singletonList(ReadFilterLibrary.NOT_SUPPLEMENTARY_ALIGNMENT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may or may not want to include WELLFORMED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea should be ok to add it but I'm a bit concerned about it having unintended consequences (e.g. read1 is tossed but it's mate isn't), so I want to leave it out for now.
"The primary firstOfPair is already set. Added read = " + read.getName()); | ||
this.firstOfPair = read; | ||
} else if (isPrimaryAlignment(read) && read.isSecondOfPair()) { | ||
this.secondOfPair = read; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh, do you also want to throw if second of pair is alreayd set?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@takutosato A few minor comments. Looks good though. I think it looks a lot cleaner than before. Feel free to merge when ready.
This tool will be used in the neovax project as a pre-processing step before running RSEM, the gene quantification tool, which has stringent requirements for the format of the input bam.