Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] adds new trained model alias API to simplify trained model updates and deployments #68922

Merged
merged 15 commits into from
Feb 18, 2021

Conversation

benwtrent
Copy link
Member

A model_alias allows trained models to be referred by a user defined moniker.

This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models.

Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated.

When using a model_alias in an ingest pipeline, only that model_alias needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically.

An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.

@benwtrent benwtrent added >feature :ml Machine learning v8.0.0 labels Feb 11, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Feb 11, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phew, that was a long review.

Code looks good my main comment is about URLs. Apologies if this has already been discussed and I'm chiming in late but I think the URLS are inconsistent. When you create an alias ml/trained_models/model_aliases is used but the only way to view the aliases is in the actual models GET _ml/trained_models/{MODEL_ID} one is an alias centric URL and the other from the models.

I suggest:

POST _ml/trained_models/{MODEL_ID}/_model_alias {
  alias: foo,
  optional_old_model_id: bar
}

And possibly

DELETE _ml/trained_models/{MODEL_ID}/_model_alias {
  alias: foo
}

GET _ml/trained_models/{MODEL_ID}/_model_alias

And/or use query parameters for alias & optional_old_model_id

@benwtrent benwtrent requested a review from davidkyle February 16, 2021 16:04
@benwtrent
Copy link
Member Author

benwtrent commented Feb 16, 2021

Recent test failure:

[2021-02-16T16:06:26,302][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] adding new models via model_aliases and ids: {}
[2021-02-16T16:06:26,396][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] cluster state event changed referenced models: before [regression_second, regression_first] after [regression_second]
[2021-02-16T16:06:26,396][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] adding new models via model_aliases and ids: {}
[2021-02-16T16:06:26,455][INFO ][o.e.x.i.IndexLifecycleTransition] [javaRestTest-0] moving index [.ml-stats-000001] from [{"phase":"hot","action":"unfollow","name":"branch-check-unfollow-prerequisites"}] to [{"phase":"hot","action":"rollover","name":"check-rollover-ready"}] in policy [ml-size-based-ilm-policy]
[2021-02-16T16:06:26,495][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] [a-perfect-regression-model] (model_alias [regression_second]) loaded from cache
[2021-02-16T16:06:26,524][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] adding new models via model_aliases and ids: {}
[2021-02-16T16:06:26,599][TRACE][o.e.x.m.i.l.ModelLoadingService] [javaRestTest-0] Persisting stats for evicted model [a-perfect-regression-model] (model_aliases [regression_second, regression_first])
[2021-02-16T16:06:26,607][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [javaRestTest-0] fatal error in thread [elasticsearch[javaRestTest-0][clusterApplierService#updateTask][T#1]], exiting
java.lang.AssertionError: null
	at org.elasticsearch.xpack.ml.inference.loadingservice.LocalModel.release(LocalModel.java:222) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService.cacheEvictionListener(ModelLoadingService.java:494) ~[?:?]
	at org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService.lambda$new$1(ModelLoadingService.java:148) ~[?:?]
	at org.elasticsearch.common.cache.Cache.delete(Cache.java:788) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.cache.Cache.lambda$new$6(Cache.java:488) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.cache.Cache$CacheSegment.remove(Cache.java:272) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.cache.Cache.invalidate(Cache.java:505) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.xpack.ml.inference.loadingservice.ModelLoadingService.clusterChanged(ModelLoadingService.java:556) ~[?:?]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListener(ClusterApplierService.java:509) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.callClusterStateListeners(ClusterApplierService.java:499) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:467) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:407) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:151) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:669) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2021-02-16T16:07:01.861625747Z] [BUILD] Stopping node

This change has introduced some race condition when evicting, changing model alias, etc. 🔍 🐛

EDIT
This has been corrected in the later commits.

Comment on lines +410 to 421
if (referencedModels.contains(modelId)
|| Sets.haveNonEmptyIntersection(modelIdToModelAliases.getOrDefault(modelId, new HashSet<>()), referencedModels)
|| consumer.equals(Consumer.SEARCH)) {
try {
// The local model may already be in cache. If it is, we don't bother adding it to cache.
// If it isn't, we flip an `isLoaded` flag, and increment the model counter to make sure if it is evicted
// between now and when the listeners access it, the circuit breaker reflects actual usage.
localModelCache.computeIfAbsent(modelId, modelAndConsumerLoader);
} catch (ExecutionException ee) {
logger.warn(() -> new ParameterizedMessage("[{}] threw when attempting add to cache", modelId), ee);
}
shouldNotAudit.remove(modelId);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidkyle This was a bit tricky. It is possible that we don't have any listeners, but still load something into cache. This would usually be due to loading a new model via a name change and NOT a new reference and during that time, no callers are requesting that model.

Also, I noticed in testing that we were unnecessarily evicting models that were accidentally cached twice (the second cache evicts the first model cached). This causes weird logging and is ultimately unnecessary if we use the computeIfAbsent logic in the model cache.

I ran this a bunch locally and it is all checking out ok. CI and should time agree.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModelLoadingService has good test coverage CI will find any problems

@benwtrent
Copy link
Member Author

@elasticmachine update branch

@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

2 similar comments
@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

@benwtrent
Copy link
Member Author

run elasticsearch-ci/2

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

ExceptionsHelper.badRequestException(
"cannot reassign model_alias [{}] to model [{}] "
+ "with inference config type [{}] from model [{}] with type [{}]",
newModel.getModelId(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing an arg: request.getModelAlias()

Copy link
Member

@davidkyle davidkyle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes LGTM2

@benwtrent benwtrent merged commit 26eef89 into elastic:master Feb 18, 2021
@benwtrent benwtrent deleted the feature/ml-trained-model-alias branch February 18, 2021 14:41
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Feb 18, 2021
…es and deployments (elastic#68922)

A `model_alias` allows trained models to be referred by a user defined moniker.

This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models.

Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated.

When using a `model_alias` in an ingest pipeline, only that `model_alias` needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically.

An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.
benwtrent added a commit that referenced this pull request Feb 18, 2021
… updates and deployments (#68922) (#69208)

* [ML] adds new trained model alias API to simplify trained model updates and deployments (#68922)

A `model_alias` allows trained models to be referred by a user defined moniker.

This not only improves the readability and simplicity of numerous API calls, but it allows for simpler deployment and upgrade procedures for trained models.

Previously, if you referenced a model ID directly within an ingest pipeline, when you have a new model that performs better than an earlier referenced model, you have to update the pipeline itself. If this model was used in numerous pipelines, ALL those pipelines would have to be updated.

When using a `model_alias` in an ingest pipeline, only that `model_alias` needs to be updated. Then, the underlying referenced model will change in place for all ingest pipelines automatically.

An additional benefit is that the model referenced is not changed until it is fully loaded into cache, this way throughput is not hampered by changing models.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature :ml Machine learning Team:ML Meta label for the ML team v7.13.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants