Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Machine Learning APIs to 5.x #2856

Merged
merged 1 commit into from
Oct 13, 2017
Merged

Add Machine Learning APIs to 5.x #2856

merged 1 commit into from
Oct 13, 2017

Conversation

codebrain
Copy link
Contributor

Additional API calls, unit test and integration tests for Machine Learning.

Copy link
Member

@Mpdreamz Mpdreamz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a beast of a PR! So happy to see ML landing fully typed in the client! 🎂 🎈 🎆 💯

32 new API's is mind boggling!

All the wonderful xmldocs hopefully sets a new standard going forward.

Most of my review comments are pedantic but would love to see them addressed before pulling.

As an additional follow up, can you think of good candidates to introduce observable helpers for and create issues for these?

It'd be very cool to also support the streaming capabilities of ML at some point :)

@@ -0,0 +1,60 @@
{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this needed? Did we submit a PR back with these local changes? did this go beyond a simple .patch.json?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised issue with the ML team - https://github.com/elastic/x-pack-elasticsearch/issues/2515

The .patch.json uses an additive merge, so to do a full replacement I have to take an alternate approach.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying problem will be fixed by elastic/x-pack-elasticsearch#2640. I will backport this to 5.6.3, but I guess you'll have to keep this patch in your 5.x client for the benefit of anyone using an older version of 5.x. However, once elastic/x-pack-elasticsearch#2640 is merged you should be able to remove this patch from your 6.x client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @droberts195. This will need to remain in 5.x client for older ES 5.x versions as you say, but can be removed in 6.x when forward ported.

{
public override IEnumerable<string> SkipQueryStringParams => new string[]
{
// TODO: figure out if needed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending TODO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russcam - I'm going to uncomment and include these in the SkipQueryStringParams

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 Could you clarify how reset_start and reset_end would be set within the body, according to the documentation? Lookiing at the implementation, these can be part of the body, which can also be passed a data description.

Looking at the REST API spec, these are query string parameters.

Could you provide some guidance please? Our thinking is to serialize the data to the request body and allow reset_start and reset_end to be set as query string parameters.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our thinking is to serialize the data to the request body and allow reset_start and reset_end to be set as query string parameters.

Yes, this is the only way to do it. With this endpoint the body is an arbitrary sequence of JSON documents that make up the data to be processed. The only limit is Elasticsearch's HTTP receive buffer size. (If we had to specify a content type it would be application/x-ndjson, although actually the endpoint will tolerate any whitespace separating the JSON documents or even no separator at all, i.e. the opening { of one document immediately after the closing } of the previous one.) So there is no way to specify parameters in the body. If you try then they'll end up being considered part of the data. The closest Elasticsearch API is the bulk API, where multiple JSON documents are sent newline delimited in the same body (although in the bulk API they really do need to be newline delimited).

The documentation that says reset_start and reset_end are body parameters is wrong. I will fix that.

There's also the question of what would be useful to an end user with this API. For it to be efficient it needs to support sending many data documents per request. If only one document is sent in each request it's really inefficient. I'm not sure how you let users specify what to upload via the bulk API, but maybe use the same or similar mechanism here.

Looking at the implementation, these can be part of the body, which can also be passed a data description.

On the transport layer the request object carries around all the bits of information slightly differently to the REST layer, and has more flexibility. This is because ML datafeeds use the post data API as part of their implementation, but always do this via the internal node client because we know the datafeed is running inside the same cluster as the job. But I think the clients should only try to do what the REST API officially supports. So please don't try to offer an option to set a different data description as part of a single post data request.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great info, thanks @droberts195. This API looks like a good candidate for an Observable helper implementation, similar to what we have for bulk, scroll, snapshot and restore (and for reindexing, before there was a reindex API)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

elastic/x-pack-elasticsearch#2689 is now open to fix the ML docs.

public Time Frequency { get; internal set; }

///<summary>A list of index names to search within. Wildcards are supported</summary>
[JsonProperty("indices")] // TODO: Check property name. Looks like indices in code, indexes in docs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending TODO should be addressed before pulling this in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidkyle confirmed that indices is correct and documentation has been updated: https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-datafeed-resource.html#ml-datafeed-resource. Will remove TODO

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed comment.

[JsonObject]
public class MachineLearningFilter
{
// TODO: Looks like type is only included when INCLUDE_TYPE_KEY set. Determine if this needs to be exposed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending TODO should be addressed, if only to add a comment that explains the INCLUDE_TYPE_KEY's purpose. Sounds like its useful for parsing? Should this always be set in 6.0 if so? Lots of questions :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd err on not exposing this right now. It's an internal implementation detail (/cc @droberts195 could you confirm?)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ML filters are currently an undocumented feature of ML that hasn't been properly tested and might change before it's officially released. So it shouldn't be documented for the NEST client either.

Even when ML filters is eventually documented, INCLUDE_TYPE_KEY shouldn't be present in any client code. You're correct that it's an internal implementation detail. It's a toXContent parameter, i.e. a quirk of the Java code, not an API parameter.

(For interest, the reason it exists is that ML stores the filters in an Elasticsearch index, and when filters are serialised to this index they need an extra field so that we can distinguish them from other documents stored in the same index. But when they're returned via the ML REST endpoints we take out this extra field so the user sees just the filter they originally created. This is all implementation details though. The end user doesn't need to know how filters are stored.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@codebrain Based on this, I think we should remove the GetFilters API altogether

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@russcam - Get/Put/Delete filters, all REST spec definitions and all associated tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking we should pull them out into a separate branch; when this branch is merged, the commits will be squashed so whilst the commits will be reachable in the reflog for some time, I think it'll be easier to get at these if we park them in a separate branch

namespace Nest
{
// TODO: Detector.java has detection rules, but these are not listed in the docs:
// TODO: Investigate of these can be set wherever a detector can be passed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pending TODO sounds important to get a full handle on to see if we expose the ML domain properly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked ML team about these and @droberts195 recommended leaving them out of the client for now. Will remove TODO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove TODOs

public override void Run(NodeConfiguration config, NodeFileSystem fileSystem)
{
var to = Path.Combine(config.FileSystem.RoamingFolder, "server_metrics.tar.gz");
if (!File.Exists(to)){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


var directoryTarget = Path.Combine(config.FileSystem.RoamingFolder, "server_metrics");
Console.WriteLine("Checking for: " + directoryTarget);
if (!Directory.Exists(directoryTarget))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invert if to reduce nesting.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -56,19 +55,19 @@ public NodeTaskRunner(NodeConfiguration nodeConfiguration)
};

public void Install(InstallationTaskBase[] additionalInstallationTasks)=>
Itterate(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 /me slaps forehead and hides in shame

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good :)

public class DetectorSerializationTests : SerializationTestBase
{
[U]
public void CanSerializeAndDeserializeAllDetectors()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 love this test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

return startDatafeedResponse;
}

protected void IndexSnapshot(IElasticClient client, string jobId, string snapshotId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mind commenting these, there is a lot of assumed knowledge here for the uninitiated (me!) on why we need to manually insert these documents.

"url": {
"parts": {
"snapshot_id": {
"required": true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened elastic/x-pack-elasticsearch#2641 so that you can avoid this patch. Like the other one, I guess you'll need to keep the patch for compatibility with 5.x versions prior to 5.6.3, but hopefully you can simplify things for the 6.x client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @droberts195. We'll be able to remove the patch when the 5.x client bumps to a Rest API spec version that includes https://github.com/elastic/x-pack-elasticsearch/pull/2641

@Mpdreamz Mpdreamz mentioned this pull request Sep 28, 2017
37 tasks
/// <summary>
/// Deletes expired data for Machine Learning.
/// </summary>
IDeleteExpiredDataResponse DeleteExpiredData(Func<DeleteExpiredDataDescriptor, IDeleteExpiredDataRequest> selector = null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 I don't see this API as publicly documented; should we be exposing it in the client? Unlike ml filters whose implementation may change, this endpoint looks much simpler so it looks OK to me to include.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it's fine to include this.

It gets called automatically once per day per cluster, and we didn't document it because that should be enough. But if people really want to call it more often it doesn't hurt.

public partial class FlushJobRequest {}

[DescriptorFor("XpackMlFlushJob")]
public partial class FlushJobDescriptor {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have an IDescriptorOverrides so that query parameters are specified in the body

/// If true, the results are sorted in descending order.
/// </summary>
[JsonProperty("desc")]
bool? Desc { get; set; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Descending

double? IGetAnomalyRecordsRequest.RecordScore { get; set; }

/// <inheritdoc />
public GetAnomalyRecordsDescriptor Desc(bool desc = true) => Assign(a => a.Desc = desc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Descending

DateTimeOffset? IGetModelSnapshotsRequest.Start { get; set; }

/// <inheritdoc />
public GetModelSnapshotsDescriptor Descending(bool desc = true) => Assign(a => a.Descending = desc);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

descending

@russcam russcam force-pushed the feature/5.x-ml-apis branch 3 times, most recently from 69c4cf8 to a929172 Compare October 13, 2017 04:55
- Unzip ML integration test data tar archives using SharpZipLib. Add net45 compatible version to Tests
- Derive XPackMachineLearningCluster from XPackCluster that seeds ML data and includes additional methods for testing ML APIs
- Skip ML integration tests for versions < 5.4.0
- Introduce IntegrationTeardown method to close jobs that may be opened as part of tests
- Increase concurrent ML job allocations and max open jobs when starting ML cluster
- Explain why certain URL parts are not replaced during code generation
    Remove special treatment for task_id. This no longer seems to be needed.

- Specify node startup timeout on ClusterBase
    starting up an ML cluster with Fiddler attached could take some time so bump the start timeout for it.
- Increase the FAKE process tooling timeout
@russcam russcam force-pushed the feature/5.x-ml-apis branch from a929172 to f395ffb Compare October 13, 2017 06:09
@russcam
Copy link
Contributor

russcam commented Oct 13, 2017

Commits have been squashed into one commit and rebased against 5.x. Merging now

@russcam russcam merged commit f395ffb into 5.x Oct 13, 2017
@russcam
Copy link
Contributor

russcam commented Oct 13, 2017

Will port to master now

@Mpdreamz Mpdreamz mentioned this pull request Oct 21, 2017
27 tasks
@russcam russcam added the v5.6.0 label Dec 22, 2017
@Mpdreamz Mpdreamz deleted the feature/5.x-ml-apis branch January 30, 2018 15:08
@russcam russcam restored the feature/5.x-ml-apis branch March 13, 2020 00:08
@russcam russcam deleted the feature/5.x-ml-apis branch March 13, 2020 00:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants