Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add doc_count field mapper #58339

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
4b5fab3
Initial version of doc_count field mapper
csoulios Apr 28, 2020
cd515b3
added tests
csoulios May 12, 2020
655e112
Build fixes
csoulios May 14, 2020
db13d83
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios May 14, 2020
191d793
Added tests for doc_count fieldmapper
csoulios May 14, 2020
5f81bee
doc count tests
csoulios Jun 17, 2020
dab8219
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Jun 23, 2020
ecdc603
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Jun 30, 2020
520ac9a
Resolve conflicts after merge from master
csoulios Jun 30, 2020
676ffc6
Added yaml test for doc_count field type
csoulios Jun 30, 2020
7c7139c
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Jun 30, 2020
d3b9c45
Minor changes to test
csoulios Jun 30, 2020
c36ecac
Fix issue with not-registering field mapper
csoulios Jul 2, 2020
4dca391
Simplify terms agg test
csoulios Jul 2, 2020
912d943
Add doc_count provider in the buckets aggregator
csoulios Jul 9, 2020
be46a00
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Jul 9, 2020
c0f23ae
Initialize doc_count provider once
csoulios Jul 14, 2020
f7b43c1
Added tests for FieldBasedDocCountProvider
csoulios Jul 15, 2020
5e1b96a
Added more tests to DocCountFieldMapper
csoulios Jul 16, 2020
80d832b
Fixed NPE at AggregatorTestCase
csoulios Jul 16, 2020
1e8b472
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Jul 17, 2020
e24d680
Updated branch to fix build after merge
csoulios Jul 17, 2020
74c727b
Added validation for single doc_count field
csoulios Jul 17, 2020
cd2c84d
Added version skips to fix broken tests
csoulios Jul 17, 2020
91246eb
Added documentation for doc_count
csoulios Jul 17, 2020
77aa346
Changes to address review comments:
csoulios Jul 20, 2020
39c43a0
Use _doc_count as Lucene field for doc count
csoulios Jul 20, 2020
8ca3fbc
Minor change: field rename
csoulios Jul 20, 2020
83929cb
Minor change to yml test.
csoulios Jul 27, 2020
848fc77
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Sep 2, 2020
0a1731d
Fix errors from merge
csoulios Sep 2, 2020
82f092a
Converted _doc_count to metadata field type
csoulios Sep 2, 2020
ba92359
Throw an error if parsed value is not a number
csoulios Sep 10, 2020
cb61366
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Sep 21, 2020
522c385
Make _doc_count field a metadata field
csoulios Sep 21, 2020
df2a2eb
Fixed broken tests
csoulios Sep 21, 2020
838436f
Fix bug in low cardinality ordinal terms aggs
csoulios Sep 22, 2020
4a92c80
Update docs that _doc_count is a metadata field
csoulios Sep 22, 2020
5d6d037
Fix broken ML tests
csoulios Sep 23, 2020
0ff6fe1
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Sep 23, 2020
23e4b30
Fix errors after merge
csoulios Sep 23, 2020
b258653
Addressed review comments
csoulios Oct 2, 2020
f5ed1df
Addressed reviewer comments
csoulios Oct 19, 2020
2fcdcf6
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Oct 20, 2020
4138d16
Added DocCountFieldTypeTests
csoulios Oct 21, 2020
5d38b7f
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Oct 27, 2020
654847e
Fix errors after merge
csoulios Oct 27, 2020
7b7ca43
Make composite agg respect _doc_count field
csoulios Oct 27, 2020
ce44e87
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Oct 27, 2020
5621c44
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Oct 27, 2020
1d969a1
DocCountProvider rethrows IOException instead of swallowing it
csoulios Oct 27, 2020
cb05034
Set familyTypeName of _doc_count to integer
csoulios Nov 2, 2020
d7d80f4
Merge branch 'feature/aggregate-metrics' into doc_count-field-mapper
csoulios Nov 2, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 8 additions & 1 deletion docs/reference/mapping/fields.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,13 @@ fields can be customized when a mapping is created.
The size of the `_source` field in bytes, provided by the
{plugins}/mapper-size.html[`mapper-size` plugin].

[discrete]
=== Doc count metadata field

<<mapping-doc-count-field,`_doc_count`>>::

A custom field used for storing doc counts when a document represents pre-aggregated data.

[discrete]
=== Indexing metadata fields

Expand All @@ -55,6 +62,7 @@ fields can be customized when a mapping is created.

Application specific metadata.

include::fields/doc-count-field.asciidoc[]

include::fields/field-names-field.asciidoc[]

Expand All @@ -69,4 +77,3 @@ include::fields/meta-field.asciidoc[]
include::fields/routing-field.asciidoc[]

include::fields/source-field.asciidoc[]

118 changes: 118 additions & 0 deletions docs/reference/mapping/fields/doc-count-field.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
[[mapping-doc-count-field]]
=== `_doc_count` data type
++++
<titleabbrev>_doc_count</titleabbrev>
++++

Bucket aggregations always return a field named `doc_count` showing the number of documents that were aggregated and partitioned
in each bucket. Computation of the value of `doc_count` is very simple. `doc_count` is incremented by 1 for every document collected
in each bucket.

While this simple approach is effective when computing aggregations over individual documents, it fails to accurately represent
documents that store pre-aggregated data (such as `histogram` or `aggregate_metric_double` fields), because one summary field may
represent multiple documents.

To allow for correct computation of the number of documents when working with pre-aggregated data, we have introduced a
metadata field type named `_doc_count`. `_doc_count` must always be a positive integer representing the number of documents
aggregated in a single summary field.

When field `_doc_count` is added to a document, all bucket aggregations will respect its value and increment the bucket `doc_count`
by the value of the field. If a document does not contain any `_doc_count` field, `_doc_count = 1` is implied by default.

[IMPORTANT]
========
* A `_doc_count` field can only store a single positive integer per document. Nested arrays are not allowed.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random thought while reading the restrictions, is it possible to define _doc_count as an object? We should forbid that as well if it isn't already... but i suspect the current restrictions prevent it from being an object too.

Copy link
Contributor Author

@csoulios csoulios Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an assertion that the input is a VALUE_NUMBER in the parseCreateField() method.

XContentParserUtils.ensureExpectedToken(XContentParser.Token.VALUE_NUMBER, parser.currentToken(), parser);

Is there anything else that should be added?

* If a document contains no `_doc_count` fields, aggregators will increment by 1, which is the default behavior.
========

[[mapping-doc-count-field-example]]
==== Example

The following <<indices-create-index, create index>> API request creates a new index with the following field mappings:

* `my_histogram`, a `histogram` field used to store percentile data
* `my_text`, a `keyword` field used to store a title for the histogram

[source,console]
--------------------------------------------------
PUT my_index
{
"mappings" : {
"properties" : {
"my_histogram" : {
"type" : "histogram"
},
"my_text" : {
"type" : "keyword"
}
}
}
}
--------------------------------------------------

The following <<docs-index_,index>> API requests store pre-aggregated data for
two histograms: `histogram_1` and `histogram_2`.

[source,console]
--------------------------------------------------
PUT my_index/_doc/1
{
"my_text" : "histogram_1",
"my_histogram" : {
"values" : [0.1, 0.2, 0.3, 0.4, 0.5],
"counts" : [3, 7, 23, 12, 6]
},
"_doc_count": 45 <1>
}

PUT my_index/_doc/2
{
"my_text" : "histogram_2",
"my_histogram" : {
"values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5],
"counts" : [8, 17, 8, 7, 6, 2]
},
"_doc_count_": 62 <1>
}
--------------------------------------------------
<1> Field `_doc_count` must be a positive integer storing the number of documents aggregated to produce each histogram.

If we run the following <<search-aggregations-bucket-terms-aggregation, terms aggregation>> on `my_index`:

[source,console]
--------------------------------------------------
GET /_search
{
"aggs" : {
"histogram_titles" : {
"terms" : { "field" : "my_text" }
}
}
}
--------------------------------------------------

We will get the following response:

[source,console-result]
--------------------------------------------------
{
...
"aggregations" : {
"histogram_titles" : {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "histogram_2",
"doc_count" : 62
},
{
"key" : "histogram_1",
"doc_count" : 45
}
]
}
}
}
--------------------------------------------------
// TESTRESPONSE[skip:test not setup]
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
setup:
- do:
indices.create:
index: test_1
body:
settings:
number_of_replicas: 0
mappings:
properties:
str:
type: keyword
number:
type: integer

- do:
bulk:
index: test_1
refresh: true
body:
- '{"index": {}}'
- '{"_doc_count": 10, "str": "abc", "number" : 500, "unmapped": "abc" }'
- '{"index": {}}'
- '{"_doc_count": 5, "str": "xyz", "number" : 100, "unmapped": "xyz" }'
- '{"index": {}}'
- '{"_doc_count": 7, "str": "foo", "number" : 100, "unmapped": "foo" }'
- '{"index": {}}'
- '{"_doc_count": 1, "str": "foo", "number" : 200, "unmapped": "foo" }'
- '{"index": {}}'
- '{"str": "abc", "number" : 500, "unmapped": "abc" }'

---
"Test numeric terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"

- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" : { "num_terms" : { "terms" : { "field" : "number" } } } }

- match: { hits.total: 5 }
- length: { aggregations.num_terms.buckets: 3 }
- match: { aggregations.num_terms.buckets.0.key: 100 }
- match: { aggregations.num_terms.buckets.0.doc_count: 12 }
- match: { aggregations.num_terms.buckets.1.key: 500 }
- match: { aggregations.num_terms.buckets.1.doc_count: 11 }
- match: { aggregations.num_terms.buckets.2.key: 200 }
- match: { aggregations.num_terms.buckets.2.doc_count: 1 }


---
"Test keyword terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str" } } } }

- match: { hits.total: 5 }
- length: { aggregations.str_terms.buckets: 3 }
- match: { aggregations.str_terms.buckets.0.key: "abc" }
- match: { aggregations.str_terms.buckets.0.doc_count: 11 }
- match: { aggregations.str_terms.buckets.1.key: "foo" }
- match: { aggregations.str_terms.buckets.1.doc_count: 8 }
- match: { aggregations.str_terms.buckets.2.key: "xyz" }
- match: { aggregations.str_terms.buckets.2.doc_count: 5 }

---

"Test unmapped string terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
bulk:
index: test_2
refresh: true
body:
- '{"index": {}}'
- '{"_doc_count": 10, "str": "abc" }'
- '{"index": {}}'
- '{"str": "abc" }'
- do:
search:
index: test_2
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" : { "str_terms" : { "terms" : { "field" : "str.keyword" } } } }

- match: { hits.total: 2 }
- length: { aggregations.str_terms.buckets: 1 }
- match: { aggregations.str_terms.buckets.0.key: "abc" }
- match: { aggregations.str_terms.buckets.0.doc_count: 11 }

---
"Test composite str_terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" :
{ "composite_agg" : { "composite" :
{
"sources": ["str_terms": { "terms": { "field": "str" } }]
}
}
}
}

- match: { hits.total: 5 }
- length: { aggregations.composite_agg.buckets: 3 }
- match: { aggregations.composite_agg.buckets.0.key.str_terms: "abc" }
- match: { aggregations.composite_agg.buckets.0.doc_count: 11 }
- match: { aggregations.composite_agg.buckets.1.key.str_terms: "foo" }
- match: { aggregations.composite_agg.buckets.1.doc_count: 8 }
- match: { aggregations.composite_agg.buckets.2.key.str_terms: "xyz" }
- match: { aggregations.composite_agg.buckets.2.doc_count: 5 }


---
"Test composite num_terms agg with doc_count":
- skip:
version: " - 7.99.99"
reason: "Doc count fields are only implemented in 8.0"
- do:
search:
rest_total_hits_as_int: true
body: { "size" : 0, "aggs" :
{ "composite_agg" :
{ "composite" :
{
"sources": ["num_terms" : { "terms" : { "field" : "number" } }]
}
}
}
}

- match: { hits.total: 5 }
- length: { aggregations.composite_agg.buckets: 3 }
- match: { aggregations.composite_agg.buckets.0.key.num_terms: 100 }
- match: { aggregations.composite_agg.buckets.0.doc_count: 12 }
- match: { aggregations.composite_agg.buckets.1.key.num_terms: 200 }
- match: { aggregations.composite_agg.buckets.1.doc_count: 1 }
- match: { aggregations.composite_agg.buckets.2.key.num_terms: 500 }
- match: { aggregations.composite_agg.buckets.2.doc_count: 11 }

Loading