-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: How can i user scheduler with OpenAI Compatible Server #8282
Comments
I think this isn't a primary concern for vLLM. It may be better to implement a separate load-balancing / scheduling layer on top of the vLLM server. |
I am using FastAPI wrapped on a vLLM server. I have tried searching for documents to create a priority queue for FastAPI but haven't found any. Do you have any suggestions for me? |
I haven't done this myself, so I wouldn't consider myself qualified to provide concrete suggestions. You can check out #4873 for more details. |
Update: I found #5958 which may serve your purpose but currently the PR only considers offline inference. |
thanks a lot |
#5958 has been merged. It should be quite straightforward to expose this argument to the OpenAI-compatible API. |
Closing as completed by #8965. |
Your current environment
How would you like to use vllm
I am currently serving an LLM via the vLLM as an OpenAI Compatible Server. I make API calls as shown below:
I would like to introduce a scheduler or prioritize requests in my API calls. Is there a way to specify a priority or use a scheduler directly in the API call, similar to the following?
Could you provide guidance on implementing this feature or suggest an alternative approach?
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: