[FEATURE] Unsynchronize the deployment for ml-commons remote models in a OpenSearch domain #3222

Zhangxunmt · 2024-11-15T19:20:03Z

Is your feature request related to a problem?
#2970. We keep getting bugs reports for models stuck in deploying/partially_deployed status, etc.

For remote models, deploying them across the entire cluster and running regular sync jobs to update deployment status on each node incurs significant overhead. This approach leads to various issues in edge cases, such as during version upgrades, cluster scaling, or node changes

What solution would you like?
Remote model deployment is quick (approximately 10 ms) and does not require pre-deployment or maintaining a domain-level deployment status before usage.

Remote models should be deployed locally on a specific node only upon receiving a prediction request and cached with a TTL. Additionally, the "Model Status" field for remote models should be removed or hidden, as there is no need to synchronize remote models across the entire cluster's memory.

When customers send a high volume of requests covering all nodes, each node caches the remote model during the first prediction request, minimizing latencies. For smaller traffic, only a subset of nodes may receive requests and cache the model. This is reasonable, as the added latency for cold invocations is acceptable.

What alternatives have you considered?
Auto-deploy for remote model has already mitigated a lot of model deployment issues, but not all edge cases are covered. This can be seen as an additional effort to further enhance our model deployment strategy.

Do you have any additional context?
#2050
#2376
#2382

Zhangxunmt added enhancement New feature or request untriaged labels Nov 15, 2024

Zhangxunmt changed the title ~~[FEATURE] Remove unnecessary Deployment for Remote Models~~ [FEATURE] Unsynchronize the deployment for ml-commons remote models in a OpenSearch domain Nov 15, 2024

Zhangxunmt removed the untriaged label Nov 15, 2024

Zhangxunmt added this to OpenSearch Roadmap Nov 15, 2024

github-project-automation bot moved this to New in OpenSearch Roadmap Nov 15, 2024

Zhangxunmt assigned Zhangxunmt and unassigned Zhangxunmt Nov 15, 2024

dhrubo-os added this to ml-commons projects Dec 3, 2024

dhrubo-os assigned Zhangxunmt Dec 3, 2024

dhrubo-os moved this to In Progress in ml-commons projects Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Unsynchronize the deployment for ml-commons remote models in a OpenSearch domain #3222

[FEATURE] Unsynchronize the deployment for ml-commons remote models in a OpenSearch domain #3222

Zhangxunmt commented Nov 15, 2024 •

edited

Loading

[FEATURE] Unsynchronize the deployment for ml-commons remote models in a OpenSearch domain #3222

[FEATURE] Unsynchronize the deployment for ml-commons remote models in a OpenSearch domain #3222

Comments

Zhangxunmt commented Nov 15, 2024 • edited Loading

Zhangxunmt commented Nov 15, 2024 •

edited

Loading