Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/async chat serving #2727

Merged
merged 13 commits into from
May 3, 2024

Conversation

schoennenbeck
Copy link
Contributor

@schoennenbeck schoennenbeck commented Feb 2, 2024

This fixes Issue 2683

Changes:

  • Added OpenAIServingChat._load_chat_template_async which waits for the tokenizer to become available.
  • According changes to OpenAIServingChat.__init__ to accomodate this (analogous implementation to ServingEngine.__init__.
  • Added test to check correct behaviour
  • Switched scope of server-fixture in existing test to make sure GPU-memory is freed once the server is not needed any more.

@simon-mo simon-mo self-assigned this Feb 2, 2024
Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the delay in review. A more sensible fix seems to be changing the __init__ arguments of the super class OpenAIServing to accept any asyncio task to be run during its _post_init function so we are sure that the chat template can be loaded after tokenizers.

@schoennenbeck
Copy link
Contributor Author

@simon-mo No worries about the delay, I very much appreciate the fact that you maintain this OSS-project in your free time.

Thanks for the feedback, I'll have a look into your suggestion and see what I can do.

@schoennenbeck
Copy link
Contributor Author

@simon-mo
New implementation:

  • OpenAIServing's __init__ now accepts an optional await_post_init, which can be any awaitable that will be awaited at the end of the _post_init-method.
  • Some corresponding changes to related tests.

Open questions:

  • Naming of the argument: await_post_init is not a fantastic name I feel.
  • Interface: One could also make it possible for await_post_init to be a list of awaitables that are awaited in order at the end of _post_init to be even more flexible. I'm open for suggestions.

@abin-tiger
Copy link

We had faced the same when using with ray serve. This fix worked for us!

@schoennenbeck
Copy link
Contributor Author

@simon-mo Would it be possible to get some feedback on this PR. The code changes are pretty minimal and seeing as there was another PR/ bug report last week it seems like there are multiple people facing this problem.

@simon-mo
Copy link
Collaborator

simon-mo commented May 2, 2024

This looks good, and you fix the conflict and I will click merge

@schoennenbeck
Copy link
Contributor Author

@simon-mo Thanks a lot. Merge requests are resolved and remaining issues with tests are fixed.

@simon-mo simon-mo merged commit f8e7add into vllm-project:main May 3, 2024
59 checks passed
robertgshaw2-redhat pushed a commit to neuralmagic/nm-vllm that referenced this pull request May 6, 2024
@@ -356,7 +359,10 @@ async def chat_completion_full_generator(

return response

def _load_chat_template(self, chat_template: Optional[str]):
async def _load_chat_template(self, chat_template: Optional[str]):
while self.tokenizer is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this line is faulty. There is no code that actually initializes self.tokenizer = None before self._load_chat_template is called. You might want to instead use hasattr to check the existence of the attribute.

Copy link
Contributor Author

@schoennenbeck schoennenbeck May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DarkLight1337 You are right. Until two weeks ago ServingEngine set self.tokenizer = None in its __init__ but that changed in this commit.
The tests still pass because by the time _load_chat_template is awaited the tokenizer is now already there (which was the idea behind this in the first place). How do you want to handle this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could open another PR simply replacing while self.tokenizer is None by while getattr(self, "tokenizer", None) is None.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess inside tests/async_engine/test_chat_template.py, you can use another MockServingChat that doesn't have the chat_template attribute. However, I am doubtful about the value of such a test since it does not depend on OpenAIServing.__init__, making it useless if another commit changes the logic.

IMO It would be more useful to test a situation where the tokenizer takes considerably longer to load, making it more likely that the chat template will be accessed before the engine has fully loaded.

Copy link
Member

@DarkLight1337 DarkLight1337 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I think the current design is not that great as it puts async logic into __init__. It would be better if __init__ requires tokenizer and chat_template upfront so that developers are encouraged to place the async logic outside of the constructor.

To maintain the same functionality as the current __init__, we can have a separate async staticmethod factory that does both async and initialization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with that sentiment. The only reason _post_init is async in the first place is that engine.get_model_config is async and in turn this is only async in order to enable the engine-workers to use ray. So 95% of the code is already synchronous and the remaining 5% are only artificially asynchronous to enable ray workers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading the chat_template used to be synchronous (before this PR) but this didn't mesh with the async code in the ServingEngine's __init__

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I think the current design is not that great as it puts async logic into init. It would be better if init requires tokenizer and chat_template upfront so that developers are encouraged to place the async logic outside of the constructor.

To maintain the same functionality as the current init, we can have a separate async staticmethod factory that does both async and initialization.

If you don't mind, I'll work on a PR regarding this issue. This should make it no longer necessary to test the async behaviour of _load_chat_template, so the relevant tests will be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to open a PR, but I'm not sure what you mean regarding the tests. As long as OpenAIServingChat can still be initiated in an async function I'm fine with everything ;)

Copy link
Member

@DarkLight1337 DarkLight1337 May 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that the chat template tests will be reverted to sync functions as opposed to async.

Edit: I have opened #4674, feel free to take a look.

z103cb pushed a commit to z103cb/opendatahub_vllm that referenced this pull request May 7, 2024
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OpenAIServingChat cannot be instantiated within a running event loop
4 participants