Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copilot Chat: Dotnet AzureTextEmbeddingGeneration is extremely slow #1952

Closed
hooyao opened this issue Jul 11, 2023 · 9 comments
Closed

Copilot Chat: Dotnet AzureTextEmbeddingGeneration is extremely slow #1952

hooyao opened this issue Jul 11, 2023 · 9 comments
Assignees
Labels
kernel Issues or pull requests impacting the core kernel python Pull requests for the Python Semantic Kernel

Comments

@hooyao
Copy link

hooyao commented Jul 11, 2023

Describe the bug
A clear and concise description of what the bug is.
When experimenting with sample copilot-chat-app, using MemoriesStore:Type:volatile, and config AIService:Type:AzureOpenAI with embedding model text-embedding-ada-002, a single "hello" chat can take 100s to complete. When switched to Azure Cognitive Search, the same chat could be done in 6s.

I was using a Azure openai deployment for testing purposes, the model is clearly not throttled.

To Reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Expected behavior
I tried to manually adding index following notebook sample 06-memory-and-embeddings.ipynb, it takes 45 seconds to index and save 5 simple sentences to memory store. Each embedding api call takes around 10s, I also tried the same embedding model from the same azure openai endpoint with langchain, it takes only 0.2s for each request.

The expected behavior is that generating a single embedding should take less than 1s instead of 10s to complete.

Screenshots
If applicable, add screenshots to help explain your problem.
copilot-chat-app is super slow with Azure Openai embedding
image

Platform

  • OS: Window and devcontainer
  • IDE: VS Code
  • Language: C#
  • Source: NuGet package version 0.17.230704.3-preview

Additional context
Add any other context about the problem here.

@shawncal shawncal added python Pull requests for the Python Semantic Kernel triage labels Jul 11, 2023
@shawncal shawncal changed the title Dotnet AzureTextEmbeddingGeneration is extremely slow Python: Dotnet AzureTextEmbeddingGeneration is extremely slow Jul 11, 2023
@hooyao hooyao changed the title Python: Dotnet AzureTextEmbeddingGeneration is extremely slow C#: Dotnet AzureTextEmbeddingGeneration is extremely slow Jul 11, 2023
@shawncal shawncal changed the title C#: Dotnet AzureTextEmbeddingGeneration is extremely slow Copilot Chat: Dotnet AzureTextEmbeddingGeneration is extremely slow Jul 17, 2023
@TaoChenOSU
Copy link
Contributor

Hello @hooyao,

Thank you for reporting the issue!

The Copilot Chat app can potentially make a few calls to the embedding model to generate multiple embeddings for each message. We are actively working on improving the performance.

However, 100s still sounds abnormal. Could you let me know how long the conversation had been going when you see the increase in time?

@TaoChenOSU TaoChenOSU self-assigned this Jul 20, 2023
@TaoChenOSU TaoChenOSU removed the triage label Jul 20, 2023
@TaoChenOSU TaoChenOSU added kernel Issues or pull requests impacting the core kernel copilot chat labels Jul 20, 2023
@glahaye
Copy link
Contributor

glahaye commented Jul 26, 2023

@hooyao I was able to reproduce the problem. Indexing simple text files containing only a handful of sentences takes 10 to 15 seconds.

Investigating what could be the root cause.

@glahaye
Copy link
Contributor

glahaye commented Jul 26, 2023

So I eliminated other calls and network factors and it seems the bulk of the time spent saving embeddings is spent within the call to kernel.Memory.SaveInformationAsync() (in this case using (AzureTextEmbeddingGeneration + VolatileMemoryStore).

I then used Fiddler to confirm my hunch that most of that time, we are busy establishing a TLS connection to the Azure Open AI server (7.6 seconds in my case). Once the TLS tunnel is set up, the actual call itself takes minimal time (0.2 seconds in my case).

I believe we create a connection from scratch every time we connect to the service to create embeddings.

That would could us a lot of time as opposed to maintaining the same connection to Azure Open AI.

@awharrison-28 Any thoughts on this? Looks like we aren't re-using an HttpClient in the SK when maybe we should!

@vgpro54321
Copy link

I confirm slowness when using 06-memory-and-embeddings.ipynb in semantic-kernel samples against Azure OpenAI.

const string MemoryCollectionName = "aboutMe";

await kernel.Memory.SaveInformationAsync(MemoryCollectionName, id: "info1", text: "My name is Andrea");
await kernel.Memory.SaveInformationAsync(MemoryCollectionName, id: "info2", text: "I currently work as a tourist operator");
await kernel.Memory.SaveInformationAsync(MemoryCollectionName, id: "info3", text: "I currently live in Seattle and have been living there since 2005");
await kernel.Memory.SaveInformationAsync(MemoryCollectionName, id: "info4", text: "I visited France and Italy five times since 2015");
await kernel.Memory.SaveInformationAsync(MemoryCollectionName, id: "info5", text: "My family is from New York");

ran for 45s

await Chat("Hello, I think we've met before, remember? my name is...");
ran for 2m 25 s.

I assume that volatile memory should be pretty much instant while saving embedding and time is most likely spend on embedding generation.

@glahaye
Copy link
Contributor

glahaye commented Jul 31, 2023

I'm looking into how we could re-use HttpClient instances in order to minimize HTTPS tunnel creation:
https://learn.microsoft.com/en-us/aspnet/core/fundamentals/http-requests?view=aspnetcore-7.0#consumption-patterns

@hooyao
Copy link
Author

hooyao commented Jul 31, 2023

Thank you @glahaye @vgpro54321 for helping to reproduce the issue. My concern is even the httpclient is not reused, the observed behavior is not expected. I tried to inject a dedicated httpclient to AzureTextEmbeddingGeneration, no luck.
And to be honest, the httpclient part of SK are filled with hand crafted code, instead of following the "normal" practice, for example using Polly to make industry-level retry policy instead of DefaultHttpRetryHandler, it a waste of time, you need more test coverage, and sometimes buggy. For httpclient best practice, I preferer injecting IHttpClientFactory instead of an httpclient instance.

@glahaye
Copy link
Contributor

glahaye commented Jul 31, 2023

@hooyao I was just about to try what you did? Are you saying that even with a dedicated (singleton?) HttpClient given to AzureTextEmbeddingGeneration, you experienced the HTTPS set up delay?

Indeed, injecting HttpClient is not the ideal solution. I'll see if I can get some traction on the IHttpClientFactory (which is usually the way to go)...

@vgpro54321
Copy link

I assume that volatile memory should be pretty much instant while saving embedding and time is most likely spend on embedding generation.

I noticed that quota for embedding model was very limited. Bumping the limits for quota increased the speed, so now it takes 2.5 s to get 10 embeddings in the notebooks, which is allright.

Framework retries accessing endpoint after a wait of a few seconds because it has retry handler on HttpClient. This also contributes to the slowness.

@github-project-automation github-project-automation bot moved this from In progress to Done in Apps & Services Semantic Kernel Sep 5, 2023
@glahaye
Copy link
Contributor

glahaye commented Sep 5, 2023

I noticed this was closed so I opened a new item in the new chat-copilot repo to pursue this further:
microsoft/chat-copilot#302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kernel Issues or pull requests impacting the core kernel python Pull requests for the Python Semantic Kernel
Projects
No open projects
Development

No branches or pull requests

6 participants