-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Student's t-test aggregation support #54469
Conversation
Adds t_test metric aggregation that can perform paired and unpaired two-sample t-tests. In this PR support for filters in unpaired is still missing. It will be added in a follow-up PR. Relates to elastic#53692
Pinging @elastic/es-analytics-geo (:Analytics/Aggregations) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few comments, this looks great! <3
|
||
A `t_test` metrics aggregation that performs a statistical hypothesis test in which the test statistic follows a Student's t-distribution | ||
under the null hypothesis on numeric values extracted from the aggregated documents or generated by provided scripts. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should perhaps add a layman explanation afterwards? Maybe something like "In practice, this will tell you if the difference between two population means are statistically significant", or something like that?
|
||
"aggregations": { | ||
"startup_time_ttest": { | ||
"value": 0.1914368843365979 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we state what "value" is? E.g. it's the p-value?
return new UnpairedTTestAggregator(name, numericMultiVS, tails, false, format, searchContext, parent, pipelineAggregators, | ||
metadata); | ||
default: | ||
throw new UnsupportedOperationException("Unsupported t-test type " + testType); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we know if UnsupportedOpException is a 4xx or 5xx? If it's a 5xx, perhaps we should change this to an IllegalArgumentException?
|
||
public class PairedTTestState implements TTestState { | ||
|
||
public static final String NAME = "P"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:D
states.forEach(tTestState -> { | ||
PairedTTestState state = (PairedTTestState) tTestState; | ||
reducer.accept(state.stats); | ||
assert state.tails == tails; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible for this to ever not match in practice (I mean, I know it shouldn't hence the assertion, but...)? I wonder if we should actually throw an exception rather than return a really incorrect result if we ever get this messed up?
return new LeafBucketCollectorBase(sub, docAValues) { | ||
@Override | ||
public void collect(int doc, long bucket) throws IOException { | ||
statsBuilder.grow(bigArrays, bucket + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can move this inside the conditionals below? Right now we'll end up growing the statsBuilder bigarray even if none of the documents end up satisfying the conditions (e.g. if we are unlucky and they all have only one of the two fields).
The size provided to grow()
is the min size required, so it's ok to not call grow until we actually need a particular bucket ordinal (and then it will back-fill all the empty bucket ords essentially)
} | ||
|
||
public void grow(BigArrays bigArrays, long buckets) { | ||
counts = bigArrays.grow(counts, buckets); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH, I'm not sure how expensive grow()
is... I don't think it's very expensive so it might not matter. But some aggs that have a lot of big arrays to manage will call BigArrays#overSize()
method directly and then resize each of their arrays, instead of grow'ing each.
StatsAggregator is a good example: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/metrics/StatsAggregator.java#L90-L95
@elasticmachine update branch |
@elasticmachine run elasticsearch-ci/default-distro |
@elasticmachine run elasticsearch-ci/default-distro |
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! <3
@elasticmachine update branch |
Adds t_test metric aggregation that can perform paired and unpaired two-sample t-tests. In this PR support for filters in unpaired is still missing. It will be added in a follow-up PR. Relates to elastic#53692
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.
Relates to #53692