Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add FieldCapabilities (_field_caps) API #23007

Merged
merged 4 commits into from
Mar 31, 2017
Merged

Add FieldCapabilities (_field_caps) API #23007

merged 4 commits into from
Mar 31, 2017

Conversation

jimczi
Copy link
Contributor

@jimczi jimczi commented Feb 6, 2017

This change introduces a new API called _field_caps that allows to retrieve the capabilities of specific fields.
This field centric API relies solely on the mapping of the requested indices to extract the following infos:

  • types: one or many field types if the type is not the same across the requested indices
  • searchable: Whether this field is indexed for search on at least one requested indices.
  • aggregatable: Whether this field can be aggregated on at least one requested indices.

Example:

GET t,v/_field_caps?fields=field1,field2,field3

returns:

{
   "fields": {
      "field1": {
         "_all": {
            "types": "text",
            "searchable": true,
            "aggregatable": false
         }
      },
      "field3": {
         "_all": {
            "types": "text",
            "searchable": true,
            "aggregatable": false
         }
      },
      "field2": {
         "_all": {
            "types": [
               "keyword",
               "long"
            ],
            "searchable": true,
            "aggregatable": true
         }
      }
   }
}

In this example field1 and field3 have the same type text across the requested indices t and v.
Conversely field2 is defined with two conflicting types keyword and long.
Note that _field_caps does not treat this case as an error but rather return the list of unique types seen for this field.

It is also possible to get a view of each index field capabilities with the level parameter:

GET t,v/_field_caps?fields=field2&level=indices
{
   "fields": {
      "field2": {
         "t": {
            "types": "keyword",
            "searchable": true,
            "aggregatable": true
         },
         "v": {
            "types": "long",
            "searchable": true,
            "aggregatable": true
         }
      }
   }
}

Closes #22438 (comment)

@jimczi jimczi added :Data Management/Indices APIs APIs to create and manage indices and templates discuss >feature v6.0.0-alpha1 labels Feb 6, 2017
@clintongormley
Copy link
Contributor

Thanks @jimczi

I have a couple of questions:

  • Should types always be an array, otherwise clients have to check the datatype before using the value
  • I wonder if the group (eg _all or <index name>) should be at the top level instead of under the field name. It'd produce far fewer objects.

@spalger any thought about the above?

@jimczi
Copy link
Contributor Author

jimczi commented Feb 7, 2017

Should types always be an array, otherwise clients have to check the datatype before using the value

For consistency yes. I'll fix this.

I wonder if the group (eg _all or ) should be at the top level instead of under the field name. It'd produce far fewer objects.

I thought that putting fields on top would simplify the navigation. I can switch back to a map keyed by index, not sure that it'd produce far fewer objects. It all depends on the number of indices vs number of fields.

@clintongormley
Copy link
Contributor

not sure that it'd produce far fewer objects. It all depends on the number of indices vs number of fields.

The typical case is requesting _all, where you'll have just a single _all object instead of one per field.

But before you change this back, I'd like to hear from @spalger about which access pattern makes more sense to him

@spalger
Copy link
Contributor

spalger commented Feb 7, 2017

Looks nice @jimczi

In regards to the types array, Kibana will need to make subsequent requests to determine the indices that have each type so that it can describe the "type conflict" in a way that can help the user find and resolve the issue: "index t uses 'keyword' and index v is using 'long'". If the api already has this information available but just isn't sending it in the response, perhaps we could find a way to include it:

{
  "fields": {
    "field1": {
      "type": "string",
      "searchable": true,
      "aggregatable": true
    },
    "field2": {
      "type": "conflict",
      "type_indices": {
        "keyword": ["t"],
        "long": ["v"]
      },
      // should these still be true? Is a field that is both keyword and long really aggregatable?
      "searchable": true,
      "aggregatable": true
    }
  }
}

re _all: I feel like the index-level grouping is unnecessary unless you specify level=index. Could we do away with the _all level of the response?

@spalger
Copy link
Contributor

spalger commented Feb 7, 2017

Also, I'm pretty sure we've decided not to go this route in the past, because consistency, but I'd like to reiterate my desire for for lists of objects as arrays rather than maps, like this:

{
  "fields": [
    {
      "name": "field1",
      // ...
    }
  ]
}

@jimczi
Copy link
Contributor Author

jimczi commented Feb 8, 2017

should these still be true? Is a field that is both keyword and long really aggregatable?

I am on the fence regarding this. In most cases it's not but what if you use a script or whatever agg that can handle this ? For searching capability it's the same so I think it's more important to have good error messages when an agg or a query is conflicting rather than forbidding the capability completely.
Currently aggregatable means that at least one index has the ability to aggregate on this field, doc_values or field_data are available for the field. Same for searchable. Maybe we could just flip this logic and make a field aggregatable iff all indices are able to aggregate on it.

In regards to the types array, Kibana will need to make subsequent requests to determine the indices that have each type so that it can describe the "type conflict" in a way that can help the user find and resolve the issue: "index t uses 'keyword' and index v is using 'long'". If the api already has this information available but just isn't sending it in the response, perhaps we could find a way to include it:

The information is available so it could look like this:

{
   "fields": {
      "field1": {
         "string": {
            "searchable": true,
            "aggregatable": true
         }
      },
      "field2": {
         "keyword": {
            "searchable": true,
            "aggregatable": true,
            "indices": ["t"]
         },
         "long": {
            "searchable": true,
            "aggregatable": true,
            "indices": ["v"]
         }
      }
   }
}

Also, I'm pretty sure we've decided not to go this route in the past, because consistency, but I'd like to reiterate my desire for for lists of objects as arrays rather than maps, like this:

What about map with consistent order ;) ?

@clintongormley
Copy link
Contributor

What about map with consistent order ;) ?
No no no no no :)

Also, I'm pretty sure we've decided not to go this route in the past, because consistency, but I'd like to reiterate my desire for for lists of objects as arrays rather than maps, like this:

Yes consistency :) Can you explain why you would prefer an array?

@spalger
Copy link
Contributor

spalger commented Feb 9, 2017

Can you explain why you would prefer an array?

In this context it's mostly because the objects in the map are incomplete without the value they are keyed by, so our iteration logic of responses like this usually goes something like:

  1. for each key
    1. get the value
    2. assign the key to the value as "name", "id", "type", or whatever is appropriate
    3. use/pass around the updated value

More generally though, my primary motivator is that the meaning of the keys in these maps is not described by the text alone. Only by knowing the API can know really know what the keys actually are. In the search/aggregation requests/responses for instance, some of the keys are for the type of query/aggregation, some are field names, some are user-generated id's, it's quite ambiguous.

@spalger
Copy link
Contributor

spalger commented Feb 9, 2017

Maybe we could just flip this logic and make a field aggregatable iff all indices are able to aggregate on it.

That's my preference. The request is executed with a specific list of indices and I think the merged value should represent the aggregatable-ness of that specific list (not a subset of that list).

what if you use a script or whatever agg that can handle this?

A script agg isn't necessarily field dependent. In my opinion it's script dependent, and knowing whether a script is aggregatable is obviously a very different problem

In regards to the types array

The information is available so it could look like this:

However we present it, if we have the information I would prefer that it was included in the response

@clintongormley
Copy link
Contributor

That's my preference. The request is executed with a specific list of indices and I think the merged value should represent the aggregatable-ness of that specific list (not a subset of that list).

If the above comment refers to the general case where there is no field conflict, then:
No, this is a mistake. For instance, I have logs-2017-01 where field foo isn't aggregatable, and logs-2017-02 where it is. Now I can't perform aggregation on logs-* until logs-2017-01 has been deleted.

If the comment refers only to the conflicting case, then I'm on the fence. Searches may still work but aggregations are much trickier and much more likely to fail. I like the syntax that @jimczi suggests in #23007 (comment) as it gives you all the information you need. The only downside is that there is no explicit conflict field - instead you have to check for the existence of more than one key.

@spalger
Copy link
Contributor

spalger commented Feb 10, 2017

I was talking about the conflict case: if I asked for the field capabilities of foo in logs-2017-01 I should see "aggregatable": false, and if I ask for the field capabilities of foo in logs-* I think I should also see "aggregatable": false

Searches may still work but aggregations are much trickier and much more likely to fail

Kibana is just going to treat conflicting fields with multiple types as not-aggregatable, regardless of what the API says, but I don't think the API should say that fields are aggregatable or searchable when that's the meaning behind it.

@jimczi
Copy link
Contributor Author

jimczi commented Feb 10, 2017

Kibana is just going to treat conflicting fields with multiple types as not-aggregatable, regardless of what the API says, but I don't think the API should say that fields are aggregatable or searchable when that's the meaning behind it.

I think this is problematic since multiple types are not necessary not-aggregatable. This is the problem with field_stats which arbitrary says that a long and a double are not compatible. We don't have such information globally, some aggs mays work and others don't. The type is just a name, for instance murmur3 is a custom type but after all it's implemented as a long so it should be aggregatable with any other number types. Though the field_caps does not have this information and as I said it would be arbitrary to say that this field is not aggregatable just because another index has the same field name with a different type.

I was talking about the conflict case: if I asked for the field capabilities of foo in logs-2017-01 I should see "aggregatable": false, and if I ask for the field capabilities of foo in logs-* I think I should also see "aggregatable": false

Ok I can flip the logic to do that. For searchable I think we can keep the current behavior since it should be possible to search on all indices even though some indices do not index the field.

@clintongormley
Copy link
Contributor

@spalger after some discussion in FixItFriday today, we've decided to go with the format suggested by @jimczi in #23007 (comment) as it is the most usable way we can present the required info.

For aggregatable, it's a bit more complex. In the typical case, (a field exists in one index and not in the other), we would mark the field as aggregatable, as you can aggregate any data that exists.

The question arises when you have a field that has doc_values: true (the default) in one index and false in the other index. If a user aggregates over these two indices then you'll get back results from the one index, but shard failures from the other. If you were to search JUST the second index, then you'd get an exception instead of shard failures.

So the question is: should this field be marked as aggregatable or not?

@clintongormley
Copy link
Contributor

@sophiec20 Pinging you for input on this API as your team will be using it too

@sophiec20
Copy link
Contributor

@dimitris-athanasiou please review.

@spalger
Copy link
Contributor

spalger commented Feb 14, 2017

The question arises when you have a field that has doc_values: true (the default) in one index and false in the other index. If a user aggregates over these two indices then you'll get back results from the one index, but shard failures from the other. If you were to search JUST the second index, then you'd get an exception instead of shard failures.

So the question is: should this field be marked as aggregatable or not?

I'm guessing that we all agree that the API should be correct as possible, but we don't seem to agree if we should treat maybe as true or false...

I'm still leaning toward false with as much supporting info as reasonable, because I feel like this is the ideal workflow for Kibana:

  • admin-type user adds the index pattern to kibana
  • fields can not be used in aggregations unless they are always aggregatable
  • non-admin-type user can consume all of the fields that are safe to use, that we know will always work and that they don't need to test over a bunch of different time-ranges to be confident about "publishing" (assuming time correlates to the indexes queried)
  • admin-type can check the index pattern page to see why certain fields are not aggregatable:
    • "Field X is not aggregatable because it is set to index: false in indexes Y and Z"
    • "Field A is not searchable because it is of type long in 6 indexes but type string elsewhere"

@clintongormley
Copy link
Contributor

I'm still leaning toward false

OK, let's do that

@dimitris-athanasiou
Copy link
Contributor

Looks good from ML side of things.

@spalger
Copy link
Contributor

spalger commented Feb 14, 2017

@jimczi do you think it would be reasonable to include details like index: false to provide the context for why aggregatable is false?

@jimczi
Copy link
Contributor Author

jimczi commented Feb 14, 2017

@jimczi do you think it would be reasonable to include details like index: false to provide the context for why aggregatable is false?

IMO no, the way we decide if the field is aggregatable or searchable depends on the field type. Some fields are searchable even though they have index:false for instance. I think it is enough if we're able to return which index is causing this value to be false.
Starting from the last proposal I don't have anything better to propose than:

{
   "fields": {
      "field1": {
         "string": {
            "searchable": true,
            "aggregatable": true
         }
      },
      "field2": {
         "keyword": {
            "searchable": false,
            "aggregatable": true,
            "non_searchable_indices": ["t"]
            "indices": ["t", "s"]
         },
         "long": {
            "searchable": true,
            "aggregatable": false,
            "non_aggregatable_indices": ["v"]
            "indices": ["v", "w"]
         }
      }
   }
}

... when aggregatable is false we add an entry listing the non-aggregatable indices and same with non-searchable indices when searchable is false.
@spalger WDYT ?

@spalger
Copy link
Contributor

spalger commented Feb 14, 2017

@jimczi this looks great, do you think the response is summarized correctly below?

  • field1
    • string
    • searchable
    • aggregatable
  • field2
    • type conflict, "keyword" in "t" and "s", but "long" in "v" and "v"
    • not searchable because of the type conflict and field configuration in index "t"
    • not aggregatable because of the type conflict and field configuration in index "v"

@jimczi
Copy link
Contributor Author

jimczi commented Feb 14, 2017

not searchable because of the type conflict and field configuration in index "t"

Only because of the field configuration in index "t". We have searchable and aggregatable for each type so the conflict is not taken into account. For Kibana you can consider every field with multiple types as conflicting but that should not be reflected in fieldcaps IMO

@spalger
Copy link
Contributor

spalger commented Feb 14, 2017

Right, I'm suggesting that as how we could represent it to users in Kibana. Thanks!

@Bargs
Copy link

Bargs commented Feb 14, 2017

Another "capability" that we currently look at the mappings for is "sortable". Is aggregatable == sortable, or would these two attributes ever diverge in the future? If so would it be possible to also add a sortable attribute to this API response?

@spalger
Copy link
Contributor

spalger commented Feb 14, 2017

Oops, good catch @Bargs

This change introduces a new API called `_field_caps` that allows to retrieve the capabilities of specific fields.
This field centric API relies solely on the mapping of the requested indices to extract the following infos:
* `types`: one or many field types if the type is not the same across the requested indices
* `searchable`: Whether this field is indexed for search on at least one requested indices.
* `aggregatable`:  Whether this field can be aggregated on at least one requested indices.

Example:

````
GET t,v/_field_caps?fields=field1,field2,field3
````

returns:

````
{
   "fields": {
      "field1": {
         "_all": {
            "types": "text",
            "searchable": true,
            "aggregatable": false
         }
      },
      "field3": {
         "_all": {
            "types": "text",
            "searchable": true,
            "aggregatable": false
         }
      },
      "field2": {
         "_all": {
            "types": [
               "keyword",
               "long"
            ],
            "searchable": true,
            "aggregatable": true
         }
      }
   }
}
````

In this example `field1` and `field3` have the same type `text` across the requested indices `t` and `v`.
Conversely `field2` is defined with two conflicting types `keyword` and `long`.
Note that `_field_caps` does not treat this case as an error but rather return the list of unique types seen for this field.

It is also possible to get a view of each index field capabilities with the `level` parameter:

````
GET t,v/_field_caps?fields=field2&level=indices
````

````
{
   "fields": {
      "field2": {
         "t": {
            "types": "keyword",
            "searchable": true,
            "aggregatable": true
         },
         "v": {
            "types": "long",
            "searchable": true,
            "aggregatable": true
         }
      }
   }
}
````
@jimczi
Copy link
Contributor Author

jimczi commented Mar 29, 2017

I pushed another iteration that implements the format described in #23007 (comment).
@jpountz can you take a look ?

@jimczi jimczi requested a review from jpountz March 29, 2017 07:50
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks great and clean. I left some questions about places where we use arrays that are logical sets, so maybe we should use sets directly? Also the documentation states clearly what null values mean for the indices arrays, but I think it would help to reiterate it in the code to have this information in context.


private final String[] indices;
private final String[] nonSearchableIndices;
private final String[] nonAggregatableIndices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should those be sets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The builder uses sets that are transformed in arrays when the final object is built.

this.isAggregatable = isAggregatable;
this.indices = indices;
this.nonSearchableIndices = nonSearchableIndices;
this.nonAggregatableIndices = nonAggregatableIndices;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see from the serialization code that thiese arrays can be null sometimes, is there any validation we should do, eg. I suspect that if one array is not null then other arrays should not be null either?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily. indices is null if all indices have the same type for the field, nonSearchableIndices is null only if all indices are either searchable or non-searchable and nonAggregatableIndices is null if all indices are either aggregatable or non-aggregatable.

}

/**
* The types of the field.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/types/type/

FieldCapabilitiesIndexRequest(FieldCapabilitiesRequest request, String index) {
super(index);
Set<String> fields = new HashSet<>();
fields.addAll(Arrays.asList(request.fields()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's pass the list as a constructor arg of the HashSet?

public class FieldCapabilitiesIndexRequest
extends SingleShardRequest<FieldCapabilitiesIndexRequest> {

private String[] fields;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we store it as a set?

this.responseMap = responseMap;
}

FieldCapabilitiesResponse() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you leave a comment that this ctor should only be used for serialization?

String[] concreteIndices =
indexNameExpressionResolver.concreteIndexNames(clusterState, request);
final AtomicInteger indexCounter = new AtomicInteger();
final AtomicInteger completionCounter = new AtomicInteger(concreteIndices.length);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just use a single atomic int for both purposes? eg. we could consider things are done when indexCounter.getAndIncrement returns concreteIndices.length-1?

for (int i = 0; i < indexResponses.length(); i++) {
Object element = indexResponses.get(i);
if (element instanceof FieldCapabilitiesIndexResponse == false) {
continue;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you assert it is an exception?


@Override
protected FieldCapabilitiesIndexResponse
shardOperation(final FieldCapabilitiesIndexRequest request,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird to go to a new line before the method name?

shardOperation(final FieldCapabilitiesIndexRequest request,
ShardId shardId) {
MapperService mapperService =
indicesService.indexService(shardId.getIndex()).mapperService();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should it use indexServiceSafe?

@jimczi jimczi merged commit a8250b2 into elastic:master Mar 31, 2017
@jimczi jimczi deleted the field_caps branch March 31, 2017 13:34
@jimczi
Copy link
Contributor Author

jimczi commented Mar 31, 2017

Thanks @jpountz
I am back porting this to 5.x (5.4) now.

If so would it be possible to also add a sortable attribute to this API response?

@Bargs @spalger if the field is aggregatable then sorting should work. I don't think we need another attribute.
The documentation is here:
https://github.com/elastic/elasticsearch/blob/master/docs/reference/search/field-caps.asciidoc

@clintongormley next step is to deprecate the FieldStats API in 5.x and remove it in 6?

@clintongormley
Copy link
Contributor

@jimczi deprecate it, but let's not remove it from 6.0 just yet

jimczi added a commit that referenced this pull request Mar 31, 2017
Cherry picked from a8250b2

This change introduces a new API called `_field_caps` that allows to retrieve the capabilities of specific fields.

Example:

````
GET t,s,v,w/_field_caps?fields=field1,field2
````
... returns:
````
{
   "fields": {
      "field1": {
         "string": {
            "searchable": true,
            "aggregatable": true
         }
      },
      "field2": {
         "keyword": {
            "searchable": false,
            "aggregatable": true,
            "non_searchable_indices": ["t"]
            "indices": ["t", "s"]
         },
         "long": {
            "searchable": true,
            "aggregatable": false,
            "non_aggregatable_indices": ["v"]
            "indices": ["v", "w"]
         }
      }
   }
}
````

In this example `field1` have the same type `text` across the requested indices `t`, `s`, `v`, `w`.
Conversely `field2` is defined with two conflicting types `keyword` and `long`.
Note that `_field_caps` does not treat this case as an error but rather return the list of unique types seen for this field.
@spalger
Copy link
Contributor

spalger commented Mar 31, 2017

🎉 🎉 💃 🎉 🎉

@karmi
Copy link
Contributor

karmi commented Apr 6, 2017

I like the API feature-wise, but can we reconsider the name? What does "caps" in "field_caps" stands for, "capabilities"? I understand the need to make the name short, but this is rather opaque?

If the API will replace the "Field Stats" API, what if we just use _fields?

}
},
"params": {
"fields": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jimczi, I think this parameter should be set as required? Getting illegal_argument_exception, "specified fields can't be null or empty" without it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not required because you can use the body of the request to set the fields to retrieve.
See the body section at the end of the spec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's right, but the question then is how to document this in the spec, when either the fields attribute or the body is required? /cc @clintongormley

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's OK - elasticsearch will complain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Data Management/Indices APIs APIs to create and manage indices and templates >feature v5.4.0 v6.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants