Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Student's t-test aggregation support #54469

Merged
merged 11 commits into from
Apr 3, 2020

Conversation

imotov
Copy link
Contributor

@imotov imotov commented Mar 30, 2020

Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to #53692

Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to elastic#53692
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments, this looks great! <3


A `t_test` metrics aggregation that performs a statistical hypothesis test in which the test statistic follows a Student's t-distribution
under the null hypothesis on numeric values extracted from the aggregated documents or generated by provided scripts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should perhaps add a layman explanation afterwards? Maybe something like "In practice, this will tell you if the difference between two population means are statistically significant", or something like that?


"aggregations": {
"startup_time_ttest": {
"value": 0.1914368843365979
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we state what "value" is? E.g. it's the p-value?

return new UnpairedTTestAggregator(name, numericMultiVS, tails, false, format, searchContext, parent, pipelineAggregators,
metadata);
default:
throw new UnsupportedOperationException("Unsupported t-test type " + testType);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we know if UnsupportedOpException is a 4xx or 5xx? If it's a 5xx, perhaps we should change this to an IllegalArgumentException?


public class PairedTTestState implements TTestState {

public static final String NAME = "P";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:D

states.forEach(tTestState -> {
PairedTTestState state = (PairedTTestState) tTestState;
reducer.accept(state.stats);
assert state.tails == tails;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for this to ever not match in practice (I mean, I know it shouldn't hence the assertion, but...)? I wonder if we should actually throw an exception rather than return a really incorrect result if we ever get this messed up?

return new LeafBucketCollectorBase(sub, docAValues) {
@Override
public void collect(int doc, long bucket) throws IOException {
statsBuilder.grow(bigArrays, bucket + 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can move this inside the conditionals below? Right now we'll end up growing the statsBuilder bigarray even if none of the documents end up satisfying the conditions (e.g. if we are unlucky and they all have only one of the two fields).

The size provided to grow() is the min size required, so it's ok to not call grow until we actually need a particular bucket ordinal (and then it will back-fill all the empty bucket ords essentially)

}

public void grow(BigArrays bigArrays, long buckets) {
counts = bigArrays.grow(counts, buckets);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, I'm not sure how expensive grow() is... I don't think it's very expensive so it might not matter. But some aggs that have a lot of big arrays to manage will call BigArrays#overSize() method directly and then resize each of their arrays, instead of grow'ing each.

StatsAggregator is a good example: https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/search/aggregations/metrics/StatsAggregator.java#L90-L95

@imotov
Copy link
Contributor Author

imotov commented Apr 2, 2020

@elasticmachine update branch

@imotov
Copy link
Contributor Author

imotov commented Apr 2, 2020

@elasticmachine run elasticsearch-ci/default-distro
@elasticmachine run elasticsearch-ci/docs

@imotov
Copy link
Contributor Author

imotov commented Apr 2, 2020

@elasticmachine run elasticsearch-ci/default-distro

@imotov
Copy link
Contributor Author

imotov commented Apr 2, 2020

@elasticmachine update branch

Copy link
Contributor

@polyfractal polyfractal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! <3

@imotov
Copy link
Contributor Author

imotov commented Apr 3, 2020

@elasticmachine update branch

@imotov imotov merged commit 5fc9fc5 into elastic:master Apr 3, 2020
imotov added a commit to imotov/elasticsearch that referenced this pull request Apr 3, 2020
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to elastic#53692
imotov added a commit that referenced this pull request Apr 6, 2020
Adds t_test metric aggregation that can perform paired and unpaired two-sample
t-tests. In this PR support for filters in unpaired is still missing. It will
be added in a follow-up PR.

Relates to #53692
@imotov imotov deleted the issue-53692-t-test branch May 1, 2020 22:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants