-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Machine Learning APIs to 5.x #2856
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What a beast of a PR! So happy to see ML landing fully typed in the client! 🎂 🎈 🎆 💯
32 new API's is mind boggling!
All the wonderful xmldocs hopefully sets a new standard going forward.
Most of my review comments are pedantic but would love to see them addressed before pulling.
As an additional follow up, can you think of good candidates to introduce observable helpers for and create issues for these?
It'd be very cool to also support the streaming capabilities of ML at some point :)
@@ -0,0 +1,60 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this needed? Did we submit a PR back with these local changes? did this go beyond a simple .patch.json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raised issue with the ML team - https://github.com/elastic/x-pack-elasticsearch/issues/2515
The .patch.json uses an additive merge, so to do a full replacement I have to take an alternate approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The underlying problem will be fixed by elastic/x-pack-elasticsearch#2640. I will backport this to 5.6.3, but I guess you'll have to keep this patch in your 5.x client for the benefit of anyone using an older version of 5.x. However, once elastic/x-pack-elasticsearch#2640 is merged you should be able to remove this patch from your 6.x client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @droberts195. This will need to remain in 5.x client for older ES 5.x versions as you say, but can be removed in 6.x when forward ported.
{ | ||
public override IEnumerable<string> SkipQueryStringParams => new string[] | ||
{ | ||
// TODO: figure out if needed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like they are part of the request body: https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-post-data.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russcam - I'm going to uncomment and include these in the SkipQueryStringParams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 Could you clarify how reset_start
and reset_end
would be set within the body, according to the documentation? Lookiing at the implementation, these can be part of the body, which can also be passed a data description.
Looking at the REST API spec, these are query string parameters.
Could you provide some guidance please? Our thinking is to serialize the data to the request body and allow reset_start
and reset_end
to be set as query string parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our thinking is to serialize the data to the request body and allow
reset_start
andreset_end
to be set as query string parameters.
Yes, this is the only way to do it. With this endpoint the body is an arbitrary sequence of JSON documents that make up the data to be processed. The only limit is Elasticsearch's HTTP receive buffer size. (If we had to specify a content type it would be application/x-ndjson
, although actually the endpoint will tolerate any whitespace separating the JSON documents or even no separator at all, i.e. the opening {
of one document immediately after the closing }
of the previous one.) So there is no way to specify parameters in the body. If you try then they'll end up being considered part of the data. The closest Elasticsearch API is the bulk API, where multiple JSON documents are sent newline delimited in the same body (although in the bulk API they really do need to be newline delimited).
The documentation that says reset_start
and reset_end
are body parameters is wrong. I will fix that.
There's also the question of what would be useful to an end user with this API. For it to be efficient it needs to support sending many data documents per request. If only one document is sent in each request it's really inefficient. I'm not sure how you let users specify what to upload via the bulk API, but maybe use the same or similar mechanism here.
Looking at the implementation, these can be part of the body, which can also be passed a data description.
On the transport layer the request object carries around all the bits of information slightly differently to the REST layer, and has more flexibility. This is because ML datafeeds use the post data API as part of their implementation, but always do this via the internal node client because we know the datafeed is running inside the same cluster as the job. But I think the clients should only try to do what the REST API officially supports. So please don't try to offer an option to set a different data description as part of a single post data request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great info, thanks @droberts195. This API looks like a good candidate for an Observable helper implementation, similar to what we have for bulk, scroll, snapshot and restore (and for reindexing, before there was a reindex API)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
elastic/x-pack-elasticsearch#2689 is now open to fix the ML docs.
public Time Frequency { get; internal set; } | ||
|
||
///<summary>A list of index names to search within. Wildcards are supported</summary> | ||
[JsonProperty("indices")] // TODO: Check property name. Looks like indices in code, indexes in docs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending TODO should be addressed before pulling this in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davidkyle confirmed that indices
is correct and documentation has been updated: https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-datafeed-resource.html#ml-datafeed-resource. Will remove TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed comment.
[JsonObject] | ||
public class MachineLearningFilter | ||
{ | ||
// TODO: Looks like type is only included when INCLUDE_TYPE_KEY set. Determine if this needs to be exposed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending TODO should be addressed, if only to add a comment that explains the INCLUDE_TYPE_KEY
's purpose. Sounds like its useful for parsing? Should this always be set in 6.0 if so? Lots of questions :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russcam ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd err on not exposing this right now. It's an internal implementation detail (/cc @droberts195 could you confirm?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ML filters are currently an undocumented feature of ML that hasn't been properly tested and might change before it's officially released. So it shouldn't be documented for the NEST client either.
Even when ML filters is eventually documented, INCLUDE_TYPE_KEY
shouldn't be present in any client code. You're correct that it's an internal implementation detail. It's a toXContent
parameter, i.e. a quirk of the Java code, not an API parameter.
(For interest, the reason it exists is that ML stores the filters in an Elasticsearch index, and when filters are serialised to this index they need an extra field so that we can distinguish them from other documents stored in the same index. But when they're returned via the ML REST endpoints we take out this extra field so the user sees just the filter they originally created. This is all implementation details though. The end user doesn't need to know how filters are stored.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@codebrain Based on this, I think we should remove the GetFilters
API altogether
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@russcam - Get/Put/Delete filters, all REST spec definitions and all associated tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking we should pull them out into a separate branch; when this branch is merged, the commits will be squashed so whilst the commits will be reachable in the reflog for some time, I think it'll be easier to get at these if we park them in a separate branch
namespace Nest | ||
{ | ||
// TODO: Detector.java has detection rules, but these are not listed in the docs: | ||
// TODO: Investigate of these can be set wherever a detector can be passed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pending TODO sounds important to get a full handle on to see if we expose the ML domain properly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asked ML team about these and @droberts195 recommended leaving them out of the client for now. Will remove TODO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove TODOs
public override void Run(NodeConfiguration config, NodeFileSystem fileSystem) | ||
{ | ||
var to = Path.Combine(config.FileSystem.RoamingFolder, "server_metrics.tar.gz"); | ||
if (!File.Exists(to)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new line {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
|
||
var directoryTarget = Path.Combine(config.FileSystem.RoamingFolder, "server_metrics"); | ||
Console.WriteLine("Checking for: " + directoryTarget); | ||
if (!Directory.Exists(directoryTarget)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invert if to reduce nesting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -56,19 +55,19 @@ public NodeTaskRunner(NodeConfiguration nodeConfiguration) | |||
}; | |||
|
|||
public void Install(InstallationTaskBase[] additionalInstallationTasks)=> | |||
Itterate( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 /me slaps forehead and hides in shame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good :)
public class DetectorSerializationTests : SerializationTestBase | ||
{ | ||
[U] | ||
public void CanSerializeAndDeserializeAllDetectors() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 love this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
return startDatafeedResponse; | ||
} | ||
|
||
protected void IndexSnapshot(IElasticClient client, string jobId, string snapshotId) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mind commenting these, there is a lot of assumed knowledge here for the uninitiated (me!) on why we need to manually insert these documents.
"url": { | ||
"parts": { | ||
"snapshot_id": { | ||
"required": true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened elastic/x-pack-elasticsearch#2641 so that you can avoid this patch. Like the other one, I guess you'll need to keep the patch for compatibility with 5.x versions prior to 5.6.3, but hopefully you can simplify things for the 6.x client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @droberts195. We'll be able to remove the patch when the 5.x client bumps to a Rest API spec version that includes https://github.com/elastic/x-pack-elasticsearch/pull/2641
/// <summary> | ||
/// Deletes expired data for Machine Learning. | ||
/// </summary> | ||
IDeleteExpiredDataResponse DeleteExpiredData(Func<DeleteExpiredDataDescriptor, IDeleteExpiredDataRequest> selector = null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 I don't see this API as publicly documented; should we be exposing it in the client? Unlike ml filters whose implementation may change, this endpoint looks much simpler so it looks OK to me to include.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think it's fine to include this.
It gets called automatically once per day per cluster, and we didn't document it because that should be enough. But if people really want to call it more often it doesn't hurt.
public partial class FlushJobRequest {} | ||
|
||
[DescriptorFor("XpackMlFlushJob")] | ||
public partial class FlushJobDescriptor {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have an IDescriptorOverrides
so that query parameters are specified in the body
/// If true, the results are sorted in descending order. | ||
/// </summary> | ||
[JsonProperty("desc")] | ||
bool? Desc { get; set; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Descending
double? IGetAnomalyRecordsRequest.RecordScore { get; set; } | ||
|
||
/// <inheritdoc /> | ||
public GetAnomalyRecordsDescriptor Desc(bool desc = true) => Assign(a => a.Desc = desc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Descending
DateTimeOffset? IGetModelSnapshotsRequest.Start { get; set; } | ||
|
||
/// <inheritdoc /> | ||
public GetModelSnapshotsDescriptor Descending(bool desc = true) => Assign(a => a.Descending = desc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
descending
69c4cf8
to
a929172
Compare
- Unzip ML integration test data tar archives using SharpZipLib. Add net45 compatible version to Tests - Derive XPackMachineLearningCluster from XPackCluster that seeds ML data and includes additional methods for testing ML APIs - Skip ML integration tests for versions < 5.4.0 - Introduce IntegrationTeardown method to close jobs that may be opened as part of tests - Increase concurrent ML job allocations and max open jobs when starting ML cluster - Explain why certain URL parts are not replaced during code generation Remove special treatment for task_id. This no longer seems to be needed. - Specify node startup timeout on ClusterBase starting up an ML cluster with Fiddler attached could take some time so bump the start timeout for it. - Increase the FAKE process tooling timeout
a929172
to
f395ffb
Compare
Commits have been squashed into one commit and rebased against 5.x. Merging now |
Will port to master now |
Additional API calls, unit test and integration tests for Machine Learning.