Add Machine Learning APIs to 5.x #2856

codebrain · 2017-09-15T05:31:58Z

Additional API calls, unit test and integration tests for Machine Learning.

Mpdreamz

What a beast of a PR! So happy to see ML landing fully typed in the client! 🎂 🎈 🎆 💯

32 new API's is mind boggling!

All the wonderful xmldocs hopefully sets a new standard going forward.

Most of my review comments are pedantic but would love to see them addressed before pulling.

As an additional follow up, can you think of good candidates to introduce observable helpers for and create issues for these?

It'd be very cool to also support the streaming capabilities of ML at some point :)

Mpdreamz · 2017-09-15T07:52:53Z

...ation/ApiGenerator/RestSpecification/XPack/MachineLearning/xpack.ml.get_buckets.replace.json

@@ -0,0 +1,60 @@
+{


Why was this needed? Did we submit a PR back with these local changes? did this go beyond a simple .patch.json?

Raised issue with the ML team - https://github.com/elastic/x-pack-elasticsearch/issues/2515

The .patch.json uses an additive merge, so to do a full replacement I have to take an alternate approach.

The underlying problem will be fixed by elastic/x-pack-elasticsearch#2640. I will backport this to 5.6.3, but I guess you'll have to keep this patch in your 5.x client for the benefit of anyone using an older version of 5.x. However, once elastic/x-pack-elasticsearch#2640 is merged you should be able to remove this patch from your 6.x client.

Thanks @droberts195. This will need to remain in 5.x client for older ES 5.x versions as you say, but can be removed in 6.x when forward ported.

Mpdreamz · 2017-09-15T07:53:24Z

src/CodeGeneration/ApiGenerator/Overrides/Descriptors/PostJobDataDescriptorOverrides.cs

+	{
+		public override IEnumerable<string> SkipQueryStringParams => new string[]
+		{
+			// TODO: figure out if needed


Pending TODO?

Looks like they are part of the request body: https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-post-data.html

@russcam - I'm going to uncomment and include these in the SkipQueryStringParams

@droberts195 Could you clarify how reset_start and reset_end would be set within the body, according to the documentation? Lookiing at the implementation, these can be part of the body, which can also be passed a data description.

Looking at the REST API spec, these are query string parameters.

Could you provide some guidance please? Our thinking is to serialize the data to the request body and allow reset_start and reset_end to be set as query string parameters.

Our thinking is to serialize the data to the request body and allow reset_start and reset_end to be set as query string parameters.

Yes, this is the only way to do it. With this endpoint the body is an arbitrary sequence of JSON documents that make up the data to be processed. The only limit is Elasticsearch's HTTP receive buffer size. (If we had to specify a content type it would be application/x-ndjson, although actually the endpoint will tolerate any whitespace separating the JSON documents or even no separator at all, i.e. the opening { of one document immediately after the closing } of the previous one.) So there is no way to specify parameters in the body. If you try then they'll end up being considered part of the data. The closest Elasticsearch API is the bulk API, where multiple JSON documents are sent newline delimited in the same body (although in the bulk API they really do need to be newline delimited).

The documentation that says reset_start and reset_end are body parameters is wrong. I will fix that.

There's also the question of what would be useful to an end user with this API. For it to be efficient it needs to support sending many data documents per request. If only one document is sent in each request it's really inefficient. I'm not sure how you let users specify what to upload via the bulk API, but maybe use the same or similar mechanism here.

Looking at the implementation, these can be part of the body, which can also be passed a data description.

On the transport layer the request object carries around all the bits of information slightly differently to the REST layer, and has more flexibility. This is because ML datafeeds use the post data API as part of their implementation, but always do this via the internal node client because we know the datafeed is running inside the same cluster as the job. But I think the clients should only try to do what the REST API officially supports. So please don't try to offer an option to set a different data description as part of a single post data request.

Great info, thanks @droberts195. This API looks like a good candidate for an Observable helper implementation, similar to what we have for bulk, scroll, snapshot and restore (and for reindexing, before there was a reindex API)

elastic/x-pack-elasticsearch#2689 is now open to fix the ML docs.

Mpdreamz · 2017-09-15T07:54:10Z

src/Nest/XPack/MachineLearning/GetDatafeeds/GetDatafeedsResponse.cs

+		public Time Frequency { get; internal set; }
+
+		///<summary>A list of index names to search within. Wildcards are supported</summary>
+		[JsonProperty("indices")] // TODO: Check property name. Looks like indices in code, indexes in docs


Pending TODO should be addressed before pulling this in

@davidkyle confirmed that indices is correct and documentation has been updated: https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-datafeed-resource.html#ml-datafeed-resource. Will remove TODO

Removed comment.

Mpdreamz · 2017-09-15T07:55:40Z

src/Nest/XPack/MachineLearning/Job/Config/MachineLearningFilter.cs

+	[JsonObject]
+	public class MachineLearningFilter
+	{
+		// TODO: Looks like type is only included when INCLUDE_TYPE_KEY set. Determine if this needs to be exposed


Pending TODO should be addressed, if only to add a comment that explains the INCLUDE_TYPE_KEY's purpose. Sounds like its useful for parsing? Should this always be set in 6.0 if so? Lots of questions :)

I'd err on not exposing this right now. It's an internal implementation detail (/cc @droberts195 could you confirm?)

ML filters are currently an undocumented feature of ML that hasn't been properly tested and might change before it's officially released. So it shouldn't be documented for the NEST client either.

Even when ML filters is eventually documented, INCLUDE_TYPE_KEY shouldn't be present in any client code. You're correct that it's an internal implementation detail. It's a toXContent parameter, i.e. a quirk of the Java code, not an API parameter.

(For interest, the reason it exists is that ML stores the filters in an Elasticsearch index, and when filters are serialised to this index they need an extra field so that we can distinguish them from other documents stored in the same index. But when they're returned via the ML REST endpoints we take out this extra field so the user sees just the filter they originally created. This is all implementation details though. The end user doesn't need to know how filters are stored.)

@codebrain Based on this, I think we should remove the GetFilters API altogether

@russcam - Get/Put/Delete filters, all REST spec definitions and all associated tests?

Thinking we should pull them out into a separate branch; when this branch is merged, the commits will be squashed so whilst the commits will be reachable in the reflog for some time, I think it'll be easier to get at these if we park them in a separate branch

Mpdreamz · 2017-09-15T07:56:37Z

src/Nest/XPack/MachineLearning/Job/Detectors/Detector.cs

+namespace Nest
+{
+	// TODO: Detector.java has detection rules, but these are not listed in the docs:
+	// TODO: Investigate of these can be set wherever a detector can be passed


Pending TODO sounds important to get a full handle on to see if we expose the ML domain properly.

Asked ML team about these and @droberts195 recommended leaving them out of the client for now. Will remove TODO.

Will remove TODOs

Mpdreamz · 2017-09-15T09:24:04Z

...anagedElasticsearch/Tasks/InstallationTasks/DownloadMachineLearningSampleDataDistribution.cs

+		public override void Run(NodeConfiguration config, NodeFileSystem fileSystem)
+		{
+			var to = Path.Combine(config.FileSystem.RoamingFolder, "server_metrics.tar.gz");
+			if (!File.Exists(to)){


new line {

Mpdreamz · 2017-09-15T09:24:40Z

...anagedElasticsearch/Tasks/InstallationTasks/DownloadMachineLearningSampleDataDistribution.cs

+
+			var directoryTarget = Path.Combine(config.FileSystem.RoamingFolder, "server_metrics");
+			Console.WriteLine("Checking for: " + directoryTarget);
+			if (!Directory.Exists(directoryTarget))


invert if to reduce nesting.

Mpdreamz · 2017-09-15T09:25:24Z

src/Tests/Framework/ManagedElasticsearch/Tasks/NodeTaskRunner.cs

@@ -56,19 +55,19 @@ public NodeTaskRunner(NodeConfiguration nodeConfiguration)
 		};

 		public void Install(InstallationTaskBase[] additionalInstallationTasks)=>
-			Itterate(


👍 /me slaps forehead and hides in shame

All good :)

Mpdreamz · 2017-09-15T09:28:49Z

src/Tests/XPack/MachineLearning/DetectorSerializationTests.cs

+	public class DetectorSerializationTests : SerializationTestBase
+	{
+		[U]
+		public void CanSerializeAndDeserializeAllDetectors()


👍 love this test

Mpdreamz · 2017-09-15T09:31:44Z

src/Tests/XPack/MachineLearning/MachineLearningIntegrationTestBase.cs

+			return startDatafeedResponse;
+		}
+
+		protected void IndexSnapshot(IElasticClient client, string jobId, string snapshotId)


Mind commenting these, there is a lot of assumed knowledge here for the uninitiated (me!) on why we need to manually insert these documents.

droberts195 · 2017-09-27T14:18:52Z

...iGenerator/RestSpecification/XPack/MachineLearning/xpack.ml.revert_model_snapshot.patch.json

+    "url": {
+      "parts": {
+        "snapshot_id": {
+          "required": true


I opened elastic/x-pack-elasticsearch#2641 so that you can avoid this patch. Like the other one, I guess you'll need to keep the patch for compatibility with 5.x versions prior to 5.6.3, but hopefully you can simplify things for the 6.x client.

Thanks @droberts195. We'll be able to remove the patch when the 5.x client bumps to a Rest API spec version that includes https://github.com/elastic/x-pack-elasticsearch/pull/2641

russcam · 2017-10-12T00:02:04Z

src/Nest/XPack/MachineLearning/DeleteExpiredData/ElasticClient-DeleteExpiredData.cs

+		/// <summary>
+		/// Deletes expired data for Machine Learning.
+		/// </summary>
+		IDeleteExpiredDataResponse DeleteExpiredData(Func<DeleteExpiredDataDescriptor, IDeleteExpiredDataRequest> selector = null);


@droberts195 I don't see this API as publicly documented; should we be exposing it in the client? Unlike ml filters whose implementation may change, this endpoint looks much simpler so it looks OK to me to include.

Yes, I think it's fine to include this.

It gets called automatically once per day per cluster, and we didn't document it because that should be enough. But if people really want to call it more often it doesn't hurt.

russcam · 2017-10-12T00:06:47Z

src/Nest/XPack/MachineLearning/FlushJob/FlushJobRequest.cs

+	public partial class FlushJobRequest {}
+
+	[DescriptorFor("XpackMlFlushJob")]
+	public partial class FlushJobDescriptor {}


Should have an IDescriptorOverrides so that query parameters are specified in the body

russcam · 2017-10-12T00:08:30Z

src/Nest/XPack/MachineLearning/GetAnomalyRecords/GetAnomalyRecordsRequest.cs

+		/// If true, the results are sorted in descending order.
+		/// </summary>
+		[JsonProperty("desc")]
+		bool? Desc { get; set; }


Descending

russcam · 2017-10-12T00:09:47Z

src/Nest/XPack/MachineLearning/GetAnomalyRecords/GetAnomalyRecordsRequest.cs

+		double? IGetAnomalyRecordsRequest.RecordScore { get; set; }
+
+		/// <inheritdoc />
+		public GetAnomalyRecordsDescriptor Desc(bool desc = true) => Assign(a => a.Desc = desc);


Descending

russcam · 2017-10-12T00:16:00Z

src/Nest/XPack/MachineLearning/GetModelSnapshots/GetModelSnapshotsRequest.cs

+		DateTimeOffset? IGetModelSnapshotsRequest.Start { get; set; }
+
+		/// <inheritdoc />
+		public GetModelSnapshotsDescriptor Descending(bool desc = true) => Assign(a => a.Descending = desc);


descending

- Unzip ML integration test data tar archives using SharpZipLib. Add net45 compatible version to Tests - Derive XPackMachineLearningCluster from XPackCluster that seeds ML data and includes additional methods for testing ML APIs - Skip ML integration tests for versions < 5.4.0 - Introduce IntegrationTeardown method to close jobs that may be opened as part of tests - Increase concurrent ML job allocations and max open jobs when starting ML cluster - Explain why certain URL parts are not replaced during code generation Remove special treatment for task_id. This no longer seems to be needed. - Specify node startup timeout on ClusterBase starting up an ML cluster with Fiddler attached could take some time so bump the start timeout for it. - Increase the FAKE process tooling timeout

russcam · 2017-10-13T06:11:16Z

Commits have been squashed into one commit and rebased against 5.x. Merging now

russcam · 2017-10-13T06:12:28Z

Will port to master now

codebrain added Feature labels Sep 15, 2017

codebrain requested review from russcam and Mpdreamz September 15, 2017 05:31

Mpdreamz requested changes Sep 15, 2017

View reviewed changes

droberts195 reviewed Sep 27, 2017

View reviewed changes

Mpdreamz mentioned this pull request Sep 28, 2017

6.0 GA readiness #2866

Closed

37 tasks

russcam reviewed Oct 12, 2017

View reviewed changes

russcam force-pushed the feature/5.x-ml-apis branch 3 times, most recently from 69c4cf8 to a929172 Compare October 13, 2017 04:55

russcam force-pushed the feature/5.x-ml-apis branch from a929172 to f395ffb Compare October 13, 2017 06:09

russcam merged commit f395ffb into 5.x Oct 13, 2017

Mpdreamz mentioned this pull request Oct 21, 2017

WIP Feature/elasticsearch 6 #2870

Merged

27 tasks

russcam added the v5.6.0 label Dec 22, 2017

Mpdreamz deleted the feature/5.x-ml-apis branch January 30, 2018 15:08

russcam restored the feature/5.x-ml-apis branch March 13, 2020 00:08

russcam deleted the feature/5.x-ml-apis branch March 13, 2020 00:16

Add Machine Learning APIs to 5.x #2856

Add Machine Learning APIs to 5.x #2856

Conversation

codebrain commented Sep 15, 2017

Mpdreamz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

russcam commented Oct 13, 2017

russcam commented Oct 13, 2017