[ML] adds new trained model alias API to simplify trained model updates and deployments #68922

benwtrent · 2021-02-11T16:51:57Z

A model_alias allows trained models to be referred by a user defined moniker.

This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models.

Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated.

When using a model_alias in an ingest pipeline, only that model_alias needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically.

An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.

…ates and deployments

elasticmachine · 2021-02-11T16:52:00Z

Pinging @elastic/ml-core (Team:ML)

…d-model-alias

davidkyle

Phew, that was a long review.

Code looks good my main comment is about URLs. Apologies if this has already been discussed and I'm chiming in late but I think the URLS are inconsistent. When you create an alias ml/trained_models/model_aliases is used but the only way to view the aliases is in the actual models GET _ml/trained_models/{MODEL_ID} one is an alias centric URL and the other from the models.

I suggest:

POST _ml/trained_models/{MODEL_ID}/_model_alias {
  alias: foo,
  optional_old_model_id: bar
}

And possibly

DELETE _ml/trained_models/{MODEL_ID}/_model_alias {
  alias: foo
}

GET _ml/trained_models/{MODEL_ID}/_model_alias

And/or use query parameters for alias & optional_old_model_id

docs/reference/ml/df-analytics/apis/update-trained-models-aliases.asciidoc

x-pack/plugin/core/src/main/java/org/elasticsearch/xpack/core/ml/job/messages/Messages.java

...n/core/src/test/java/org/elasticsearch/xpack/core/ml/integration/MlRestTestStateCleaner.java

...in/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportDeleteTrainedModelAction.java

...ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportGetTrainedModelsStatsAction.java

...lugin/ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPutTrainedModelAction.java

.../src/main/java/org/elasticsearch/xpack/ml/action/TransportUpdateTrainedModelAliasAction.java

…d-model-alias

benwtrent · 2021-02-16T20:23:25Z

Recent test failure:

[2021-02-16T16:06:26,302][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] adding new models via model_aliases and ids: {}
[2021-02-16T16:06:26,396][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] cluster state event changed referenced models: before [regression_second, regression_first] after [regression_second]
[2021-02-16T16:06:26,396][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] adding new models via model_aliases and ids: {}
[2021-02-16T16:06:26,455][INFO ][o.e.x.i.IndexLifecycleTransition] [javaRestTest-0] moving index [.ml-stats-000001] from [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}] to [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}] in policy [ml-size-based-ilm-policy]
[2021-02-16T16:06:26,495][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] [a-perfect-regression-model] (model_alias [regression_second]) loaded from cache
[2021-02-16T16:06:26,524][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] adding new models via model_aliases and ids: {}
[2021-02-16T16:06:26,599][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] Persisting stats for evicted model [a-perfect-regression-model] (model_aliases [regression_second, regression_first])
[2021-02-16T16:06:26,607][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [javaRestTest-0] fatal error in thread [elasticsearch[javaRestTest-0][clusterApplierService#updateTask][T#1]], exiting
java.lang.AssertionError: null
	at org.elasticsearch.xpack.ml.inference.loadingservice.LocalModel.release(LocalModel.java:222) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService.cacheEvictionListener(ModelLoadingService.java:494) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService.lambda$new$1(ModelLoadingService.java:148) ~[?:?]
	at org.elasticsearch.common.cache.Cache.delete(Cache.java:788) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.cache.Cache.lambda$new$6(Cache.java:488) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.cache.Cache$CacheSegment.remove(Cache.java:272) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.cache.Cache.invalidate(Cache.java:505) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService.clusterChanged(ModelLoadingService.java:556) ~[?:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:499) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2021-02-16T16:07:01.861625747Z] [BUILD] Stopping node

This change has introduced some race condition when evicting, changing model alias, etc. 🔍 🐛

EDIT
This has been corrected in the later commits.

benwtrent · 2021-02-16T21:34:54Z

...l/src/main/java/org/elasticsearch/xpack/ml/inference/loadingservice/ModelLoadingService.java

+            if (referencedModels.contains(modelId)
+                || Sets.haveNonEmptyIntersection(modelIdToModelAliases.getOrDefault(modelId, new HashSet<>()), referencedModels)
+                || consumer.equals(Consumer.SEARCH)) {
+                try {
+                    // The local model may already be in cache. If it is, we don't bother adding it to cache.
+                    // If it isn't, we flip an `isLoaded` flag, and increment the model counter to make sure if it is evicted
+                    // between now and when the listeners access it, the circuit breaker reflects actual usage.
+                    localModelCache.computeIfAbsent(modelId, modelAndConsumerLoader);
+                } catch (ExecutionException ee) {
+                    logger.warn(() -> new ParameterizedMessage("[{}] threw when attempting add to cache", modelId), ee);
+                }
                shouldNotAudit.remove(modelId);


@davidkyle This was a bit tricky. It is possible that we don't have any listeners, but still load something into cache. This would usually be due to loading a new model via a name change and NOT a new reference and during that time, no callers are requesting that model.

Also, I noticed in testing that we were unnecessarily evicting models that were accidentally cached twice (the second cache evicts the first model cached). This causes weird logging and is ultimately unnecessary if we use the computeIfAbsent logic in the model cache.

I ran this a bunch locally and it is all checking out ok. CI and should time agree.

ModelLoadingService has good test coverage CI will find any problems

benwtrent · 2021-02-17T15:11:49Z

@elasticmachine update branch

benwtrent · 2021-02-17T16:02:13Z

run elasticsearch-ci/2

benwtrent · 2021-02-17T16:44:43Z

run elasticsearch-ci/2

benwtrent · 2021-02-17T19:27:06Z

run elasticsearch-ci/2

davidkyle

LGTM

docs/reference/ml/df-analytics/apis/put-trained-models-aliases.asciidoc

.../ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPutTrainedModelAliasAction.java

.../ml/src/main/java/org/elasticsearch/xpack/ml/inference/persistence/TrainedModelProvider.java

x-pack/plugin/src/test/resources/rest-api-spec/test/ml/inference_crud.yml

…/elasticsearch into feature/ml-trained-model-alias

davidkyle · 2021-02-18T10:07:36Z

.../ml/src/main/java/org/elasticsearch/xpack/ml/action/TransportPutTrainedModelAliasAction.java

+                                ExceptionsHelper.badRequestException(
+                                    "cannot reassign model_alias [{}] to model [{}] "
+                                    + "with inference config type [{}] from model [{}] with type [{}]",
+                                    newModel.getModelId(),


Missing an arg: request.getModelAlias()

…d-model-alias

davidkyle

Thanks for the changes LGTM2

…es and deployments (elastic#68922) A `model_alias` allows trained models to be referred by a user defined moniker. This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models. Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated. When using a `model_alias` in an ingest pipeline, only that `model_alias` needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically. An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.

… updates and deployments (#68922) (#69208) * [ML] adds new trained model alias API to simplify trained model updates and deployments (#68922) A `model_alias` allows trained models to be referred by a user defined moniker. This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models. Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated. When using a `model_alias` in an ingest pipeline, only that `model_alias` needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically. An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.

[ML] adds new trained model aliases API to simplify trained model upd…

b4a20f1

…ates and deployments

benwtrent added >feature :ml Machine learning v8.0.0 labels Feb 11, 2021

elasticmachine added the Team:ML Meta label for the ML team label Feb 11, 2021

benwtrent added 4 commits February 11, 2021 12:29

fixing docs for model alias

d1956c0

fixing tests

59e00ba

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

0da41c9

…d-model-alias

fixing tests

b1733ca

davidkyle reviewed Feb 15, 2021

View reviewed changes

benwtrent added 2 commits February 16, 2021 10:54

addressing PR comments and updating URL

4dd7993

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

9e3db18

…d-model-alias

benwtrent requested a review from davidkyle February 16, 2021 16:04

fixing reference count bug and unnecessary model evictions

ca3985f

benwtrent commented Feb 16, 2021

View reviewed changes

fixing test

4c8a6b3

Merge branch 'master' into feature/ml-trained-model-alias

9627f14

davidkyle approved these changes Feb 17, 2021

View reviewed changes

benwtrent added 2 commits February 17, 2021 16:13

addressing PR comments

ab3b131

Merge branch 'feature/ml-trained-model-alias' of github.com:benwtrent…

9b8af9e

…/elasticsearch into feature/ml-trained-model-alias

davidkyle reviewed Feb 18, 2021

View reviewed changes

benwtrent added 2 commits February 18, 2021 07:10

fixing exception message

6cc0e3f

Merge remote-tracking branch 'upstream/master' into feature/ml-traine…

7b66376

…d-model-alias

davidkyle approved these changes Feb 18, 2021

View reviewed changes

fixing test

67a6733

benwtrent merged commit 26eef89 into elastic:master Feb 18, 2021

benwtrent deleted the feature/ml-trained-model-alias branch February 18, 2021 14:41

benwtrent added v7.13.0 backport pending labels Feb 18, 2021

This was referenced Feb 23, 2021

[DOCS] Link to trained model aliases API elastic/stack-docs#1591

Merged

[DOCS] Edits trained model alias API #69491

Merged

[DOCS] Adds model alias to inference processor and aggregation #69576

Merged

darnautov mentioned this pull request Mar 3, 2021

[ML] Support trained model aliases elastic/kibana#93392

Closed

2 tasks

stevejgordon mentioned this pull request Apr 21, 2021

7.13.0 Meta Ticket elastic/elasticsearch-net#5584

Closed

62 tasks

benwtrent removed the backport pending label May 19, 2021

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] adds new trained model alias API to simplify trained model updates and deployments #68922

[ML] adds new trained model alias API to simplify trained model updates and deployments #68922

benwtrent commented Feb 11, 2021

elasticmachine commented Feb 11, 2021

davidkyle left a comment

benwtrent commented Feb 16, 2021 •

edited

Loading

benwtrent Feb 16, 2021

davidkyle Feb 17, 2021

benwtrent commented Feb 17, 2021

benwtrent commented Feb 17, 2021

benwtrent commented Feb 17, 2021

benwtrent commented Feb 17, 2021

davidkyle left a comment

davidkyle Feb 18, 2021

davidkyle left a comment

[ML] adds new trained model alias API to simplify trained model updates and deployments #68922

[ML] adds new trained model alias API to simplify trained model updates and deployments #68922

Conversation

benwtrent commented Feb 11, 2021

elasticmachine commented Feb 11, 2021

davidkyle left a comment

Choose a reason for hiding this comment

benwtrent commented Feb 16, 2021 • edited Loading

benwtrent Feb 16, 2021

Choose a reason for hiding this comment

davidkyle Feb 17, 2021

Choose a reason for hiding this comment

benwtrent commented Feb 17, 2021

benwtrent commented Feb 17, 2021

benwtrent commented Feb 17, 2021

benwtrent commented Feb 17, 2021

davidkyle left a comment

Choose a reason for hiding this comment

davidkyle Feb 18, 2021

Choose a reason for hiding this comment

davidkyle left a comment

Choose a reason for hiding this comment

benwtrent commented Feb 16, 2021 •

edited

Loading