Introduce inference reserve config for standby instance #265

kerthcet · 2025-02-08T09:36:37Z

What would you like to be added:

In serverless scenario, once scaled to 0, and traffic comes, it takes time to recover the service, generally we'll use a standby instance to mitigate the situation, however, usually, a standard llm service may use high-end GPUs which cost a lot.

So we hope to introduce a new config, the inferenceReserveConfig to define the standby instance with lower GPUs, then once traffic comes again, the lower-configured instance will server the traffic immediately, also the response is slow, and trigger the standard-configured instances to roll out.

Why is this needed:

Completion requirements:

This enhancement requires the following artifacts:

Design doc
API change
Docs update

The artifacts should be linked in subsequent comments.

kerthcet · 2025-02-08T09:37:29Z

/priority important-longterm

InftyAI-Agent added needs-triage Indicates an issue or PR lacks a label and requires one. needs-kind Indicates a PR lacks a label and requires one. needs-priority Indicates a PR lacks a label and requires one. labels Feb 8, 2025

kerthcet changed the title ~~Introduce inference reserve config as a standby instance~~ Introduce inference reserve config for standby instance Feb 8, 2025

InftyAI-Agent added important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-priority Indicates a PR lacks a label and requires one. labels Feb 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce inference reserve config for standby instance #265

Introduce inference reserve config for standby instance #265

kerthcet commented Feb 8, 2025

kerthcet commented Feb 8, 2025

Introduce inference reserve config for standby instance #265

Introduce inference reserve config for standby instance #265

Comments

kerthcet commented Feb 8, 2025

kerthcet commented Feb 8, 2025