-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
making optimizations to funcotator #4740
Conversation
lbergelson
commented
May 7, 2018
- improving slow String handling
- enabling feature caching
final int splitPoint = trimmedExtraField.indexOf(EXTRA_FIELD_KEY_VALUE_SPLITTER); | ||
final String fieldName; | ||
final String fieldValue; | ||
if(splitPoint == -1 ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need a space after if
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
final String fieldValue; | ||
if(splitPoint == -1 ) { | ||
throw new UserException.MalformedFile("Extraneous optional field data - not in a key/value pair: " + extraField); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
else
is unnecessary here since the if
body throws. Once you get rid of else
you can move the declarations for fieldName
and fieldValue
down here where they're assigned
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point. done
if(splitPoint == -1 ) { | ||
throw new UserException.MalformedFile("Extraneous optional field data - not in a key/value pair: " + extraField); | ||
} else { | ||
fieldName = trimmedExtraField.substring(0,splitPoint).trim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need space after ,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First-pass review complete, back to @lbergelson
@@ -809,14 +809,15 @@ private String createProgramGroupID(final SAMFileHeader header) { | |||
*/ | |||
protected FeatureInput<? extends Feature> addFeatureInputsAfterInitialization(final String filePath, | |||
final String name, | |||
final Class<? extends Feature> featureType) { | |||
final Class<? extends Feature> featureType, | |||
final int featureQueryLookahead) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add javadoc for method args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -212,6 +212,7 @@ | |||
* The current version of {@link Funcotator}. | |||
*/ | |||
public static final String VERSION = "0.0.2"; | |||
public static final int FEATURE_QUERY_LOOKAHEAD = 100000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to reference one of the other predefined constants here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do you want a different value for Gencode vs. other feature inputs? Should we expose a command-line arg at the GATKTool
and/or VariantWalker
level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we might want different values for gencode... I'm not sure though. I think maybe we can leave that decision to the future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a good option would be to put the look ahead value as a datasource property, that way each one could customize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonn-smith Any thoughts on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I've set it to use the same value of 100_000, but I'm referring to the value in VariantWalkerBase
. It looks like all variant walker features, both driving and side inputs are set to 100k.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it should be a hidden / advanced command-line parameter on a data source type (Gencode / VCF / XSV / etc.).
I think that will be a good generalizable way to have them configurable.
I'm not sure I want to expose this to the config files - at some point users themselves are going to be poking around in the config files for the data sources, so I'd be worried that they would change things unknowingly and suffer for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonn-smith I'm not sure I understand what you mean by making it a parameter on the datasource type. Would we have a different parameter for each datasource type? I'm going to merge this now and we can revisit if we need more granular control I think.
@@ -580,7 +581,8 @@ private void addFeaturesForLocatableDataSource( final Path dataSourceFile, | |||
IOUtils.getPath( dataSourceProperties.getProperty(DataSourceUtils.CONFIG_FILE_FIELD_NAME_SRC_FILE) ) | |||
).toUri().toString(), | |||
name, | |||
featureClazz | |||
featureClazz, | |||
FEATURE_QUERY_LOOKAHEAD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm that Funcotator never "looks back" when querying its FeatureContext
? If it does, the existing caching scheme might be non-ideal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like it only ever calls it on the same location as the driving variant. It does some contig translation, but it should always translate the same way so I don't think that's a problem.
pbs = new PositionalBufferedStream(bufferedInputStream); | ||
} | ||
return new AsciiLineReaderIterator(new AsciiLineReader(pbs)); | ||
return new AsciiLineReaderIterator(AsciiLineReader.from(bufferedInputStream)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this method guaranteed to wrap the stream in a PositionalBufferedStream
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes
|
||
// The value of the field may be between two quotes. | ||
// We remove them here. | ||
final String rawFieldValue = trimmedExtraField.substring(splitPoint + 1, trimmedExtraField.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's more than 1 occurrence of the delimiter, the extra tokens will now all be concatenated together in rawFieldValue
, whereas before they would be ignored. What's the desired behavior here? Should multiple delimiters be an error condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jonn-smith Can you answer the case of what should happen if there are multiple delimiters here? Previous behavior was that everything after the second element in the split array would be ignored. It seems like maybe it should be an error condition to have >2 elements though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming that's an error case and added a check for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbergelson I think that's fine for now. I don't remember if I ran into an issue that required that I ignore the second delimiter on.
// The value of the field may be between two quotes. | ||
// We remove them here. | ||
final String rawFieldValue = trimmedExtraField.substring(splitPoint + 1, trimmedExtraField.length()); | ||
fieldValue = StringUtils.remove(rawFieldValue.trim(), '"'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we assume that the quotes are always at the beginning/end? If yes, we could make this even faster using a substring()
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know... It would make sense that they would be. @jonn-smith?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lbergelson - I think we can assume that quotes would wrap the strings.
That said, we will need to check to see if quotes are in the string rather than blindly substringing them.
@@ -211,6 +211,20 @@ public void metaTestEnsureTempDirs() { | |||
Assert.assertEquals(doDebugTests, false); | |||
} | |||
|
|||
@Test | |||
public void testLocalThing(){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a better name...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing this.. didn't mean to include it
@lbergelson Can you add a summary of the performance improvements (eg., runtime % speedup) introduced in this branch as a comment to the master Funcotator performance ticket (#4586)? |
ff3f50c
to
27d121b
Compare
@droazen Running a larger sample to have a better number. |
Codecov Report
@@ Coverage Diff @@
## master #4740 +/- ##
==============================================
+ Coverage 79.985% 79.99% +0.005%
- Complexity 17419 17421 +2
==============================================
Files 1081 1081
Lines 63118 63119 +1
Branches 10195 10196 +1
==============================================
+ Hits 50485 50489 +4
+ Misses 8648 8645 -3
Partials 3985 3985
|
@droazen Any further comments? |
@lbergelson - might make sense to update the Funcotator version from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last comment, then this should be good to merge
} | ||
|
||
// Each optional field is in a key/value pair: | ||
final String fieldName = fieldParts[EXTRA_FIELD_KEY_INDEX].trim(); | ||
final String fieldName = trimmedExtraField.substring(0, splitPoint).trim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't you end up with an empty fieldName
/fieldValue
after parsing? Should that be an error condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing in person with @lbergelson, we decided that we don't know enough about the data to add extra safety checks here. If leading whitespace is typical/common rather than an error condition, we don't want to overly-strict extra checks for empty keys.
* improving slow String handling * enabling feature caching
27d121b
to
0b38ff8
Compare