Introduce inference reserve config for standby instance #265
Labels
important-longterm
Important over the long term, but may not be staffed and/or may need multiple releases to complete.
needs-kind
Indicates a PR lacks a label and requires one.
needs-triage
Indicates an issue or PR lacks a label and requires one.
What would you like to be added:
In serverless scenario, once scaled to 0, and traffic comes, it takes time to recover the service, generally we'll use a standby instance to mitigate the situation, however, usually, a standard llm service may use high-end GPUs which cost a lot.
So we hope to introduce a new config, the inferenceReserveConfig to define the standby instance with lower GPUs, then once traffic comes again, the lower-configured instance will server the traffic immediately, also the response is slow, and trigger the standard-configured instances to roll out.
Why is this needed:
Completion requirements:
This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: