Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Computers need IDs, people want labels #17877

Closed
Tracked by #179668
elasticmachine opened this issue Feb 10, 2017 · 20 comments
Closed
Tracked by #179668

Computers need IDs, people want labels #17877

elasticmachine opened this issue Feb 10, 2017 · 20 comments
Labels
Feature:Graph Graph application feature impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. Team:Visualizations Visualization editors, elastic-charts and infrastructure

Comments

@elasticmachine
Copy link
Contributor

Original comment by @markharwood:

This old chestnut is a general concern with Kibana and specifically an issue in Graph.

The unit of our analysis is terms (terms aggs, significant_terms etc) and for this reason they need to be unique:

  • There is more than one movie called "crash" in the movielens data
  • There is more than one John Smith in a bank's records.

Consequently, to avoid confusion, unique IDs are generated to represent these entities and we must index those for analysis BUT - when visualizing data in graph UI or elsewhere people typically don't want to see the ugly IDs and want useful labels instead. This translation service could be a configurable feature of graph ("the label for ID field X can be found in index Y and field Z"). This translation can be implemented as a single multi-get operation when new IDs are loaded into the graph workspace. Equally this could be a general feature as part of Kibana for use in all visualizations.

In looking at Panama papers I was forced to index terms that were both an ID and a label - the ID was required to avoid merging multiple "John Smith"s into one but the label was also required to be useful to end users. This made for an ugly UI and added code to the ingest process. The bank client forked the graph UI to trim the ID part of the term from the displayed terms in order to make the UI less ugly.

@elasticmachine
Copy link
Contributor Author

Original comment by @colings86:

This looks similar to #5009 from @skearns64

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

@skearns64 Looks users are needing a solution here. For graph we need the unambiguity of a unique ID for correct linking purposes but the readability of a label. Here's the options I see:

  1. Index ID and label together in one term then trim off ID at display time (I hacked this into the deployment at a bank for panama papers investigations)
  2. Index IDs, look up labels at display time.

Option 2 feels like the more robust solution. It would also allow retrieval of properties other than labels e.g. image URLs (think mugshots in a policing system). Before I built the generic Graph UI we have today, I built several bespoke apps on the graph API for the datasets Wikipedia, MovieLens and BestBuy. Each of these used a call-out which took the new vertex IDs being loaded into the workspace and did an mget to load JSON docs that could be attached as metadata to the nodes and used in displays. Clearly this custom code could be replaced with some generic UI settings to define the mapping. The question is where to put this setting - is it
a) part of the general Kibana field formatting definitions or
b) part of the graph field definitions?

Thoughts?

@elasticmachine
Copy link
Contributor Author

Original comment by @skearns64:

@markharwood - yea, I agree that this would be super useful. I'd love to hear from @epixa, about how he sees #5009 and whether he feels that we should solve this (resolve IDs to strings for display) in Kibana generically via that issue (or other), or whether it is far enough out, or this feels like a separate enough use-case that we should consider adding it directly to the Graph UI..

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

They way I see a basic label lookup being used in Graph is to add the following fields to define the lookup index:

!LINK REDACTED

In the StackOverflow data shown in the example I am Graphing documents of the type "Post" (to draw out the connections between users and tags e.g. who might be an expert in #elasticsearch). Tags conveniently serve as both an ID and a label but unique user IDs are needing for graphing but need to show a user name to be readable. The StackOverflow data has a seperate index for users containing their display names, bios, image URLs etc. and we would need to take the IDs we use to identify vertices uniquely and lookup the user name from this index.
We could also potentially support use of the image URLs in this case but I propose we start with implementing simple label lookups first using the mget approach I outlined in an earlier comment.
Note that the dropdown for indices would need to be physical indices known to the cluster not index patterns known to Kibana because we want the speed and certainty of a direct GET lookup by ID rather than issuing searches with an ID. What we lookup will tend to be nouns not events anyway so are less likely to appear in the time-based index patterns declared in Kibana for managing event stores.

@elasticmachine
Copy link
Contributor Author

Original comment by @skearns64:

@markharwood - I think that's a workable approach if we only wanted it for Graph, but I expect that we'll want some sort of control like this for Kibana more generally, perhaps as part of index pattern definition?

cc @epixa

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

I spoke to @spalger last night and he suggested keeping this to Graph only for now (I thought he was adding a comment to this ticket as we spoke so it may have wound up somewhere else?)

@elasticmachine
Copy link
Contributor Author

Original comment by @epixa:

I agree that this should just go into graph right now. This is something that I want to tackle in Kibana, but if we wait to do that, it could be months before we get around to it.

@elasticmachine
Copy link
Contributor Author

Original comment by @mikeh-elastic:

In my customer discussions using the uid for the graph exploration and displaying something like a first+last name and/or an image url for the icon has been requested.

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

Had a customer discussion and they want to map out their infrastructure of nodes/services and show dependencies from a store of health-check events (eg "service A called service B OK").
When a service has a bad time they want a marker icon (perhaps another issue...) but clicking on a service vertex should reveal the email address of the person responsible for this service.

So to expand the scope of this ticket from "computers want IDs, people want labels" this should perhaps be renamed "event stores hold minimal info, people want detail on the entities they reference".

As a foundation it would be useful to assume that each vertex loaded into the graph could optionally have a looked-up JSON structure with many fields that we could use to populate various parts of the graph UI.

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

I'm going off this idea of attaching labels only at the last minute when IDs need displaying.

It helps avoid a common ETL step (enriching incoming events with reference data looked up by ID) but a normalized event store with only IDs makes basic free-text searching difficult and also prevents the proposed LINK REDACTED where the combination of person, address and company name labels held in docs give relevance ranking algorithms lots of useful data to chew on.

If we adopt the convention I outlined of combining ID and label e.g [2348787] John smith
This can be indexed as:

  • an untokenized field for graphing etc
  • a tokenized form for free-text search

If this is a common enough convention the graph UI could have special treatment for "ID plus label" tokens:

  1. There could be an option to strip ugly IDs from labels used in the UI or only reveal IDs on hover
  2. Terms that share a common ID but have a different label (e.g. OFAC's aliases) could automatically be linked or grouped in the UI.
  3. Following hard links (hitting the expand button in the graph UI) would search by the ID part of the node term only.
  4. Following LINK REDACTED would search using the label part, not the ID (makes sense because soft linking is essentially for discovering things like ID:x but not ID:x). The ID part is irrelevant in this case and occasionally problematic e.g. when soft-linking from PanamaPapers IDs+Labels to OFAC sanctions data the panama IDs would occasionally match a zipcode in OFAC.

We would need to add a configuration option in graph to declare a field's terms as "containing IDs and labels" and then all of the above functionality could be unlocked.

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

Following a discussion with @colings86 we outlined the following possible options for generally associating labels with IDs:

  1. ID-plus-label untokenized terms (e.g. [12342] John Smith)
  2. Nested docs (nested doc contains ID plus arbitrary associated info)
  3. Lookup-index (the original proposal here to lookup arbitrary info from a "noun-store").

A fourth option is to try associate a label for a given field from the same (non-nested) Lucene doc but this is not practical for a variety of reasons.

For my money option 1 is the least-worst scenario and so we could phase this in as follows:

  1. Advocate ID-plus-label as the best-practice indexing approach for analytics - labels are generally not unique (more than one "John Smith") so not useful for analytics and unique IDs on their own are not readable. Write blogs etc describing the practice and people can work with elastic stack as it stands today.
  2. Build support into mappings that help define single ID-plus-label tokens using an encoding that later allows IDs to be separated from labels e.g. square brackets around [ID] label
  3. Build support into Kibana/graph analytics that has special treatment for ID-plus-label tokens e.g. the four features listed in my previous comment.

We continually butt heads with the need for hard IDs and softer, human-understandable labels so we need to find a way through this.

@elasticmachine
Copy link
Contributor Author

Original comment by @clintongormley:

@markharwood an extension of option 1 would be to set up an analyzer like the following (requires elastic/elasticsearch#18064 to be added to work properly):

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "names": {
          "tokenizer": "standard",
          "char_filter": [
            "exclude_id"
          ],
          "filter": [
            "lowercase"
          ]
        },
        "id": {
          "tokenizer": "keyword",
          "char_filter": [
            "extract_id"
          ]
        }
      },
      "char_filter": {
        "extract_id": {
          "type": "pattern_replace",
          "pattern": "^\\[(\\d+)\\].*$",
          "replacement": "$1"
        },
        "exclude_id": {
          "type": "pattern_replace",
          "pattern": "^\\[\\d+\\]\\s*(.*)$",
          "replacement": "$1"
        }
      }
    }
  },
  "mappings": {
    "entity": {
      "properties": {
        "entity": {
          "type": "text",
          "analyzer": "names",
          "fields": {
            "id": {
              "type": "keyword",
              "analyzer": "id"
            }
          }
        }
      }
    }
  }
}

Then you can use the entity.id field for entity linking (and it uses doc values), entity for name search, and _source.entity for display.

Obviously this doesn't just work out of the box, which is a downside.

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

Thanks for the mapping, @clintongormley !

and _source.entity for display.

The problem is most of our analytics (Kibana bar charts, Graph UI...) is on agg results from fielddata/docvalues so accessing _source of individual docs for display purposes is out of the question.
I was thinking more of an "indexed-for-analytics" encoding convention that combines ID and label where the consuming app (Kibana/Graph) can have a standard way to split ID from label if it knows from the mapping that the terms are encoded that way.

@elasticmachine
Copy link
Contributor Author

Original comment by @clintongormley:

Actually, you could store the [id] name as a keyword field and be done with it (and keep the text field for search). That way it would work with aggs as well

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

Actually, you could store the [id] name as a keyword field and be done with it

I think the basis of my proposal is we go one step further - we have a type called id/label that is like keyword but acts as a marker to tell any consuming analytics client tool that it can treat the contents as a an ID and a label which can be split.
If we don't have this then any analytic tool is always going to present ugly keyword terms with non-detachable IDs.

@elasticmachine
Copy link
Contributor Author

Original comment by @skearns64:

Do we have field-level metadata in the mappings? I wonder if a middle-road here would be to support metadata on the field level in the mappings. This metadata could be used by default in native ES to explain "magic" like how dynamically detected string field foo has a foo.keyword. The metadata could, in theory hold useful things like "aggregatable" (for Kibana to use) as well as describing relationships between fields (ID->display, etc). Copy-to could also do similar annotations. If all that were available, then we could have a "concat" in addition to copy-to, which would meet the need here?

Maybe this is crazy though :)

@elasticmachine
Copy link
Contributor Author

Original comment by @markharwood:

Do we have field-level metadata in the mappings?

I don't believe so but we have doctype-level metadata which can be arbitrarily complex JSON used to describe the doc as a whole. We could use that to refer to fields by name e.g.

PUT test
{
   "mappings": {
      "doc": {
               "_meta": {
                   "AggregatableFields": [
                            {
                                "SKUAndName":{
                                    "ID": "product.sku",
                                    "Name": "product.name"
                                }
                            }
                       ]
               },          
         "properties": {
            ...

Obviously we'd need to work on what convention we might want to adopt for use in there.

It's important to remember arrays of things e.g. products in an order cannot easily keep the relationship between the various product IDs and associated product names without resorting to nested docs or complex script logic about same-array-positions. This is why I advocate a convention of combined ID+label tokens in the source docs and mapping logic to support splitting them.
If Kibana is to build support only on a mapping convention the one we have outlined here (my mapping metadata and Clinton's analyzer example) feels like a pretty long-winded way of declaring things and is also prone to mis-configuration.
If we introduce a specialized field type rather than doing things by-convention we can tackle the metadata and analysis sides of this problem in a simple way that ensures validity

@elasticmachine elasticmachine added the Feature:Graph Graph application feature label Apr 24, 2018
@timroes timroes added the Feature:Visualizations Generic visualization features (in case no more specific feature label is available) label Aug 8, 2018
@timroes timroes added Team:Visualizations Visualization editors, elastic-charts and infrastructure and removed Feature:Visualizations Generic visualization features (in case no more specific feature label is available) labels Sep 16, 2018
@timroes timroes added Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. and removed Team:Visualizations Visualization editors, elastic-charts and infrastructure labels Sep 3, 2021
@elasticmachine
Copy link
Contributor Author

Pinging @elastic/kibana-data-discovery (Team:DataDiscovery)

@stratoula stratoula removed the Team:DataDiscovery Discover, search (e.g. data plugin and KQL), data views, saved searches. For ES|QL, use Team:ES|QL. label Nov 4, 2022
@stratoula stratoula added the Team:Visualizations Visualization editors, elastic-charts and infrastructure label Nov 4, 2022
@elasticmachine
Copy link
Contributor Author

Pinging @elastic/kibana-visualizations @elastic/kibana-visualizations-external (Team:Visualizations)

@stratoula stratoula added the impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. label Jun 2, 2023
@timductive
Copy link
Member

Closing this because it's not planned to be resolved in the foreseeable future. It will be tracked in our Icebox and will be re-opened if our priorities change. Feel free to re-open if you think it should be melted sooner.

@timductive timductive closed this as not planned Won't fix, can't repro, duplicate, stale Mar 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Graph Graph application feature impact:low Addressing this issue will have a low level of impact on the quality/strength of our product. Team:Visualizations Visualization editors, elastic-charts and infrastructure
Projects
None yet
Development

No branches or pull requests

4 participants