Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track when fields contain multi-value arrays and expose in field_caps or mappings #64077

Closed
wylieconlon opened this issue Oct 22, 2020 · 9 comments
Labels
:Analytics/Aggregations Aggregations >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@wylieconlon
Copy link

Multi-value arrays cause problems in Kibana because we default to single-value fields, while multi-value fields create a higher doc_count than the total doc_count on the index. So instead of showing numbers like "100% of documents" we might end up showing "200% of documents", which looks completely wrong.

The request here is for Elasticsearch to make it easier to know that a field is expected to contain multiple values, in the mapping and field_caps responses. This would help us generate the right queries from Kibana.

On field_caps, we might get a response like this:

    "products.product_name.keyword" : {
      "keyword" : {
        "type" : "keyword",
        "searchable" : true,
        "aggregatable" : true,
        "multi_value": true
      }
    },

Example of the problem

In this example we have the Terms aggregation reporting 7,409 as the overall doc_count, while there are only 4,675 total hits in the query. This is confusing and needs to be handled by the client code:

POST kibana_sample_data_ecommerce/_search
{
  "size": 0,
  "aggs": {
    "names": {
      "terms": {
        "field": "products.category.keyword"
      }
    },
    "overall": {
      "sum_bucket": {
        "buckets_path": "names>_count"
      }
    }
  }
}

I would expect that the overall value in the response would equal the total number of documents. But this is not the case:

{
  "took" : 9,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 4675,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "names" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "Men's Clothing",
          "doc_count" : 2024
        },
        {
          "key" : "Women's Clothing",
          "doc_count" : 1903
        },
        {
          "key" : "Women's Shoes",
          "doc_count" : 1136
        },
        {
          "key" : "Men's Shoes",
          "doc_count" : 944
        },
        {
          "key" : "Women's Accessories",
          "doc_count" : 830
        },
        {
          "key" : "Men's Accessories",
          "doc_count" : 572
        }
      ]
    },
    "overall" : {
      "value" : 7409.0
    }
  }
}
@wylieconlon wylieconlon added >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team needs:triage Requires assignment of a team area label labels Oct 22, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search (:Search/Mapping)

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (:Analytics/Aggregations)

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 22, 2020
@jtibshirani jtibshirani removed the needs:triage Requires assignment of a team area label label Oct 22, 2020
@nik9000
Copy link
Member

nik9000 commented Oct 26, 2020

I just noticed this one! I've needed to test if a field has more than one value deep inside of a query implementation I'm working on. I'm fairly sure this information is available in Lucene for numbers but I have haven't checked if it is available for keyword style indexes. What would this mean for text fields? Are you interested in the number of actual fields or in the number of tokens the field makes?

@romseygeek
Copy link
Contributor

If this is something to be exposed in field caps or mappings then we don't want to look at values in the index itself for performance reasons. We've previously discussed adding flags to mappers to forbid multiple values in #58523

@wylieconlon
Copy link
Author

wylieconlon commented Oct 26, 2020

@nik9000 It's not relevant for the use cases I was considering where fielddata is disabled, but I imagine that if a user has enabled fielddata then we should treat it as a multi-value field. In the default case it would be single-value.

The reasoning behind this is shown in the example above: it's important to know the denominator of the terms aggregation compared to the total number of values.

@nik9000
Copy link
Member

nik9000 commented Oct 26, 2020

If this is something to be exposed in field caps or mappings then we don't want to look at values in the index itself for performance reasons. We've previously discussed adding flags to mappers to forbid multiple values in #58523

You can ask Points how many documents and how many points it contains - if they are the same then all docs are single valued. Is that the kind of metadata stuff we do with field caps?

@jimczi
Copy link
Contributor

jimczi commented Oct 28, 2020

Don't you have this information in the response already ? The sum is greater than the doc count so you have some documents that have multiple values. Can you clarify what you'd like to do with this information at the field caps level ? It's hard to see why it would be helpful to know globally and what actions you can trigger from this info.

@wylieconlon
Copy link
Author

@jimczi We don't always have this information, like when the top terms are actually less than the total count of documents, or when we aren't tracking total hits. The important part of the example above is that it's often ambiguous in the UI whether we are showing a "count of documents" or a "count of values in an array". It would help us provide a less ambiguous UI if knew for sure that a field is a singleton. Unlike #58523, what I'm asking for in this issue is to assume that fields are singletons until they aren't.

By having a flag that indicates that a field is expected to contain more than one value, we can guide users away from some bad practices:

  • Correlating two multi-value array fields in the same visualization does not work as expected most of the time. We could suggest using the nested mapping or creating separate documents
  • Converting multi-value array fields to a percentage is probably misleading as shown in the example above
  • Doing searches on multi-value arrays can be confusing, and we could provide better help text when this is happening

We can't guide users in this way unless we have this extra information. We could work around it by analyzing individual documents with the fields API, but what if we didn't need to?

@javanna
Copy link
Member

javanna commented May 1, 2023

This issue mentions the same or similar problems as #80825 when it comes to single valued fields vs multi valued fields, as well as the need to possibly enforce single value and expose the info in field_caps. I am closing in favour of #80825.

@javanna javanna closed this as not planned Won't fix, can't repro, duplicate, stale May 1, 2023
@javanna javanna added Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch and removed Team:Search Meta label for search team labels Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >enhancement :Search Foundations/Mapping Index mappings, including merging and defining field types stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

No branches or pull requests

8 participants