Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

making optimizations to funcotator #4740

Merged
merged 1 commit into from
May 9, 2018
Merged

Conversation

lbergelson
Copy link
Member

  • improving slow String handling
  • enabling feature caching

@droazen droazen self-requested a review May 7, 2018 15:24
@droazen droazen self-assigned this May 7, 2018
@droazen droazen changed the title making optimizations to functoator making optimizations to funcotator May 7, 2018
final int splitPoint = trimmedExtraField.indexOf(EXTRA_FIELD_KEY_VALUE_SPLITTER);
final String fieldName;
final String fieldValue;
if(splitPoint == -1 ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need a space after if

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

final String fieldValue;
if(splitPoint == -1 ) {
throw new UserException.MalformedFile("Extraneous optional field data - not in a key/value pair: " + extraField);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

else is unnecessary here since the if body throws. Once you get rid of else you can move the declarations for fieldName and fieldValue down here where they're assigned

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. done

if(splitPoint == -1 ) {
throw new UserException.MalformedFile("Extraneous optional field data - not in a key/value pair: " + extraField);
} else {
fieldName = trimmedExtraField.substring(0,splitPoint).trim();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need space after ,

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First-pass review complete, back to @lbergelson

@@ -809,14 +809,15 @@ private String createProgramGroupID(final SAMFileHeader header) {
*/
protected FeatureInput<? extends Feature> addFeatureInputsAfterInitialization(final String filePath,
final String name,
final Class<? extends Feature> featureType) {
final Class<? extends Feature> featureType,
final int featureQueryLookahead) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc for method args

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -212,6 +212,7 @@
* The current version of {@link Funcotator}.
*/
public static final String VERSION = "0.0.2";
public static final int FEATURE_QUERY_LOOKAHEAD = 100000;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to reference one of the other predefined constants here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you want a different value for Gencode vs. other feature inputs? Should we expose a command-line arg at the GATKTool and/or VariantWalker level?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we might want different values for gencode... I'm not sure though. I think maybe we can leave that decision to the future?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a good option would be to put the look ahead value as a datasource property, that way each one could customize it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonn-smith Any thoughts on that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I've set it to use the same value of 100_000, but I'm referring to the value in VariantWalkerBase. It looks like all variant walker features, both driving and side inputs are set to 100k.

Copy link
Collaborator

@jonn-smith jonn-smith May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson

This seems like it should be a hidden / advanced command-line parameter on a data source type (Gencode / VCF / XSV / etc.).

I think that will be a good generalizable way to have them configurable.

I'm not sure I want to expose this to the config files - at some point users themselves are going to be poking around in the config files for the data sources, so I'd be worried that they would change things unknowingly and suffer for it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonn-smith I'm not sure I understand what you mean by making it a parameter on the datasource type. Would we have a different parameter for each datasource type? I'm going to merge this now and we can revisit if we need more granular control I think.

@@ -580,7 +581,8 @@ private void addFeaturesForLocatableDataSource( final Path dataSourceFile,
IOUtils.getPath( dataSourceProperties.getProperty(DataSourceUtils.CONFIG_FILE_FIELD_NAME_SRC_FILE) )
).toUri().toString(),
name,
featureClazz
featureClazz,
FEATURE_QUERY_LOOKAHEAD
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm that Funcotator never "looks back" when querying its FeatureContext? If it does, the existing caching scheme might be non-ideal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like it only ever calls it on the same location as the driving variant. It does some contig translation, but it should always translate the same way so I don't think that's a problem.

pbs = new PositionalBufferedStream(bufferedInputStream);
}
return new AsciiLineReaderIterator(new AsciiLineReader(pbs));
return new AsciiLineReaderIterator(AsciiLineReader.from(bufferedInputStream));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this method guaranteed to wrap the stream in a PositionalBufferedStream?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes


// The value of the field may be between two quotes.
// We remove them here.
final String rawFieldValue = trimmedExtraField.substring(splitPoint + 1, trimmedExtraField.length());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's more than 1 occurrence of the delimiter, the extra tokens will now all be concatenated together in rawFieldValue, whereas before they would be ignored. What's the desired behavior here? Should multiple delimiters be an error condition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jonn-smith Can you answer the case of what should happen if there are multiple delimiters here? Previous behavior was that everything after the second element in the split array would be ignored. It seems like maybe it should be an error condition to have >2 elements though?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm assuming that's an error case and added a check for now

Copy link
Collaborator

@jonn-smith jonn-smith May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson I think that's fine for now. I don't remember if I ran into an issue that required that I ignore the second delimiter on.

// The value of the field may be between two quotes.
// We remove them here.
final String rawFieldValue = trimmedExtraField.substring(splitPoint + 1, trimmedExtraField.length());
fieldValue = StringUtils.remove(rawFieldValue.trim(), '"');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we assume that the quotes are always at the beginning/end? If yes, we could make this even faster using a substring() call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know... It would make sense that they would be. @jonn-smith?

Copy link
Collaborator

@jonn-smith jonn-smith May 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson - I think we can assume that quotes would wrap the strings.

That said, we will need to check to see if quotes are in the string rather than blindly substringing them.

@@ -211,6 +211,20 @@ public void metaTestEnsureTempDirs() {
Assert.assertEquals(doDebugTests, false);
}

@Test
public void testLocalThing(){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need a better name...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing this.. didn't mean to include it

@droazen
Copy link
Contributor

droazen commented May 8, 2018

@lbergelson Can you add a summary of the performance improvements (eg., runtime % speedup) introduced in this branch as a comment to the master Funcotator performance ticket (#4586)?

@lbergelson lbergelson force-pushed the lb_funcotator_optimizations branch from ff3f50c to 27d121b Compare May 8, 2018 18:02
@lbergelson
Copy link
Member Author

@droazen Running a larger sample to have a better number.

@codecov-io
Copy link

codecov-io commented May 8, 2018

Codecov Report

Merging #4740 into master will increase coverage by 0.005%.
The diff coverage is 69.231%.

@@              Coverage Diff               @@
##              master    #4740       +/-   ##
==============================================
+ Coverage     79.985%   79.99%   +0.005%     
- Complexity     17419    17421        +2     
==============================================
  Files           1081     1081               
  Lines          63118    63119        +1     
  Branches       10195    10196        +1     
==============================================
+ Hits           50485    50489        +4     
+ Misses          8648     8645        -3     
  Partials        3985     3985
Impacted Files Coverage Δ Complexity Δ
...titute/hellbender/tools/funcotator/Funcotator.java 76.351% <ø> (ø) 30 <0> (ø) ⬇️
...org/broadinstitute/hellbender/engine/GATKTool.java 91.304% <100%> (ø) 93 <1> (ø) ⬇️
...llbender/utils/codecs/gencode/GencodeGtfCodec.java 55.556% <100%> (+0.271%) 49 <1> (ø) ⬇️
...bender/utils/codecs/gencode/GencodeGtfFeature.java 78.453% <63.636%> (-0.211%) 44 <0> (ø)
...utils/smithwaterman/SmithWatermanIntelAligner.java 80% <0%> (+30%) 3% <0%> (+2%) ⬆️

@lbergelson
Copy link
Member Author

Original (46a8661): 31.1 minutes, 1658 variants / minute
Changes (#4740): 1.26 minutes, 40933 variants / minute

~24.6x speedup

@lbergelson
Copy link
Member Author

@droazen Any further comments?

@jonn-smith
Copy link
Collaborator

@lbergelson - might make sense to update the Funcotator version from 0.0.2 to 0.0.3 since this is a big improvement.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment, then this should be good to merge

}

// Each optional field is in a key/value pair:
final String fieldName = fieldParts[EXTRA_FIELD_KEY_INDEX].trim();
final String fieldName = trimmedExtraField.substring(0, splitPoint).trim();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you end up with an empty fieldName/fieldValue after parsing? Should that be an error condition?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After discussing in person with @lbergelson, we decided that we don't know enough about the data to add extra safety checks here. If leading whitespace is typical/common rather than an error condition, we don't want to overly-strict extra checks for empty keys.

* improving slow String handling
* enabling feature caching
@lbergelson lbergelson force-pushed the lb_funcotator_optimizations branch from 27d121b to 0b38ff8 Compare May 9, 2018 16:04
@lbergelson lbergelson merged commit 869292b into master May 9, 2018
@lbergelson lbergelson deleted the lb_funcotator_optimizations branch May 9, 2018 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants