Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Unsynchronize the deployment for ml-commons remote models in a OpenSearch domain #3222

Open
Zhangxunmt opened this issue Nov 15, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Zhangxunmt
Copy link
Collaborator

Zhangxunmt commented Nov 15, 2024

Is your feature request related to a problem?
#2970. We keep getting bugs reports for models stuck in deploying/partially_deployed status, etc.

For remote models, deploying them across the entire cluster and running regular sync jobs to update deployment status on each node incurs significant overhead. This approach leads to various issues in edge cases, such as during version upgrades, cluster scaling, or node changes

What solution would you like?
Remote model deployment is quick (approximately 10 ms) and does not require pre-deployment or maintaining a domain-level deployment status before usage.

Remote models should be deployed locally on a specific node only upon receiving a prediction request and cached with a TTL. Additionally, the "Model Status" field for remote models should be removed or hidden, as there is no need to synchronize remote models across the entire cluster's memory.

When customers send a high volume of requests covering all nodes, each node caches the remote model during the first prediction request, minimizing latencies. For smaller traffic, only a subset of nodes may receive requests and cache the model. This is reasonable, as the added latency for cold invocations is acceptable.

What alternatives have you considered?
Auto-deploy for remote model has already mitigated a lot of model deployment issues, but not all edge cases are covered. This can be seen as an additional effort to further enhance our model deployment strategy.

Do you have any additional context?
#2050
#2376
#2382

@Zhangxunmt Zhangxunmt added enhancement New feature or request untriaged labels Nov 15, 2024
@Zhangxunmt Zhangxunmt changed the title [FEATURE] Remove unnecessary Deployment for Remote Models [FEATURE] Unsynchronize the deployment for ml-commons remote models in a OpenSearch domain Nov 15, 2024
@Zhangxunmt Zhangxunmt assigned Zhangxunmt and unassigned Zhangxunmt Nov 15, 2024
@dhrubo-os dhrubo-os moved this to In Progress in ml-commons projects Dec 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: New
Status: In Progress
Development

No branches or pull requests

1 participant