-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New grpc requests going to grpc service pods in terminating state. #13575
Comments
Related is your discussion topic: https://github.com/knative/serving/discussions/13516 |
Can you make a sample repo with instructions/Dockerfile to create a container with the app? /triage needs-user-input |
Discussion topic duplicating this issue: https://github.com/knative/serving/discussions/13572 |
Thank you. I will create sample repo and share the details. |
Hi,
Thank you. |
Hi Dave, Thank you. |
hey @msgurikar , thanks for putting together this example. I am going through it. I started with Grpc Service repo but I am having problems creating the docker image.
it seems |
Hi, Once service image is created, need deploy it as knative service knative_service.yaml and it will run and waiitng for input grpc requests. Thank you. |
@msgurikar
|
Sorry about that. I have fixed it and pushed the changes into https://github.com/msgurikar/grpc_bidirectional_server Thank you. |
So, under On the other hand we have Then we have this:
with this I created the docker image: I started by running them in my local laptop, like this:
And then run the client like so:
the output on the server console is:
Questions:
|
Hi,
Thank you for your email.
Yes. We need to start the service first => grpc_bidirection_server @ 40056
and next => grpc_bidirection_client @ 40081
then we run TestGrpcClient by giving num of requests Eg: 10, this request
goes to grpc_bidirection_client, based on number of requests,
grpc_bidirection_client::Compute will create those many grpc requests and
send to grpc_bidirection_server service.
If you run grpc_bidirection_server in KNative, it will create 10 pods, 1
pod for each request. The problem comes, when we request for 30, then some
of the pods starts terminating, then we send again few more requests Eg:
20, some of new requests goes to pod that are in terminating state.
Hope this clears things up.
…On Mon, Feb 13, 2023 at 3:30 PM Juan Sanin ***@***.***> wrote:
@msgurikar <https://github.com/msgurikar>
So, under grpc_bidirectional_server I created sample-grpc-service docker
image. This container exposed port 40056.
On the other hand we have grpc_bidirecional_client/SampleGrpcBiDirection.
I created client-samplegrpc-bidirection docker image based on the
Dockerfile in that folder. This exposes port 40081
Then we have this: grpc_bidirecional_client/TestGrpcClient (which did not
have Dockerfile), I went ahead and created a Dockerfile based on the
Dockerfile present in SampleGrpcBiDirection. This is the Dockerfile I
endup with:
#See https://aka.ms/containerfastmode to understand how Visual Studio uses this Dockerfile to build your images for faster debugging.
FROM mcr.microsoft.com/dotnet/runtime:6.0 AS base
WORKDIR /app
FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
WORKDIR /src
COPY ["TestGrpcClient/TestGrpcClient.csproj", "TestGrpcClient/"]
RUN dotnet restore "TestGrpcClient/TestGrpcClient.csproj"
COPY . .
WORKDIR "/src/TestGrpcClient"
RUN dotnet build "TestGrpcClient.csproj" -c Release -o /app/build
FROM build AS publish
RUN dotnet publish "TestGrpcClient.csproj" -c Release -o /app/publish
FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
# ENV QI_ALGO_SERVICE_TARGET="127.0.0.1:40056"
# EXPOSE 40081
ENTRYPOINT ["dotnet", "TestGrpcClient.dll"]
with this I created the docker image: test-grpc-client which by the code
in
https://github.com/msgurikar/grpc_bidirecional_client/blob/main/TestGrpcClient/TestGrpcClient/Program.cs#L50
I concluded this will call the server in SampleGrpcBiDirection.
I started by running them in my local laptop, like this:
⚡ docker run --rm -p 40081:40081 --network host -it client-samplegrpc-bidirection
WARNING: Published ports are discarded when using host network mode
Current director is /app/Logs
server listening on port 40081
info: GrpcHostedService[0]
server listening on port 40081
info: Microsoft.Hosting.Lifetime[0]
Application started. Press Ctrl+C to shut down.
info: Microsoft.Hosting.Lifetime[0]
Hosting environment: Production
info: Microsoft.Hosting.Lifetime[0]
Content root path: /app/
And then run the client like so:
☁ TestGrpcClient [main] ⚡ docker run --rm --network host -it test-grpc-client
Hello, World!
Press 9 to close, any other keys to start computation
o
Eneter number of requests to send to KNative grpc service
10
Exception throw Status(StatusCode="Unknown", Detail="Exception was thrown by handler.", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"Error received from peer ipv4:127.0.0.1:40081","file":"/var/local/git/grpc/src/core/lib/surface/call.cc","file_line":953,"grpc_message":"Exception was thrown by handler.","grpc_status":2}")
Press 9 to close, any other keys to start computation
the output on the server console is:
Compute request recieved with number of requests is 10
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing 1
Error occured while computing 2
Error occured while computing 1
Error occured while computing 2
Error occured while computing 0
Error occured while computing 1
Error occured while computing 2
Error occured while computing 0
Error occured while computing 9
Error occured while computing 1
Error occured while computing 2
Error occured while computing 0
Error occured while computing 9
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing 1
Error occured while computing 2
Error occured while computing 0
Error occured while computing 9
Error occured while computing 7
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing 1
Error occured while computing 2
Error occured while computing 0
Error occured while computing 9
Error occured while computing 7
Error occured while computing 5
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing 1
Error occured while computing 2
Error occured while computing 0
Error occured while computing 9
Error occured while computing 7
Error occured while computing 5
Error occured while computing 4
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing 1
Error occured while computing 2
Error occured while computing 0
Error occured while computing 9
Error occured while computing 7
Error occured while computing 5
Error occured while computing 4
Error occured while computing 3
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing 1
Error occured while computing 2
Error occured while computing 0
Error occured while computing 9
Error occured while computing 7
Error occured while computing 5
Error occured while computing 4
Error occured while computing 3
Error occured while computing 6
Error occured while computing 1
Error occured while computing 2
Error occured while computing 0
Error occured while computing 9
Error occured while computing 7
Error occured while computing 5
Error occured while computing 4
Error occured while computing 3
Error occured while computing 6
Error occured while computing 8
One or more Compute tasks have been failed due to Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: ***@***.***","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
*Questions*:
1. Did I miss something ? how all this is tied together ?
2. Is there a client for testing server running on 40056 ?
—
Reply to this email directly, view it on GitHub
<#13575 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ARUWGYFOAXDYQ3GWSRSGPVLWXKRYBANCNFSM6AAAAAATSAMTKQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
I deployed the sample server as a knative service. I have been executing the test over and over again and I have not got any failures. I am running this in a 3 node GKE cluster with 4vCPU and 16Gb each. Questions:
|
For instance, this is how the list of pods looks after some minutes without sending any requests:
IMHO, it should not take that long to terminate the process. |
@jsanin-vmw Questions: Could you provide additional info of your environment ? Have you tried sending 30 requests and wait for some pods to get terminate and send another 20 requests and check, if we repeat this step for 4-5 times, i am getting this issue. Thank you. |
How do you know that pods in I saw that once they go to
It only allows me to send 20 req max. |
I was checking logs of terminating pods,since it takes while to get killed, I get error on sample grpc bidirection client side. Thank you. |
by checking the status of the pod when it goes to Terminating I see this:
another example is this:
|
I have not received any errors from the
even though there are a bunch of |
Ok. You dont see new incoming requests going to Terminating pods. I could see it on my side, and at times when terminating pod serving the request gets Killed after graceperiod, then i get the error. |
Will you be able to modify your In the mean time, I will give it a try in AKS with Kourier. |
Sure. I will modify server code and update you once its updated in git. |
you might want to use this to get the pod name. |
I wanted to report that I tested it in
I observed the same behavior reported earlier. The knative service is running on the cluster. I am running locally the client and the test. I am running them as local docker containers that would connect to the remote ksvc. I went ahead and created another instance of the tests, and like so I was able to run 40 req simultaneously. I did this several times. |
@jsanin-vmw |
No. I would like to give more details about how I am testing this.
Then build the grpc service image:
Push this image to a registry. Create a kn service with this image:
Patch the ConfigMap
Setup your DNS or your Get your ksvc host name:
Now go to
Create the image
Run the
Now run the
Run the tests with the # request you want to create. The current I will put this down for now. I was not able to reproduce the behavior you reported after testing this in GKE and AKS. Let me know when you modify the server to message back with the POD name and its status. |
@jsanin-vmw I have updated grpc_bidirectional_service and client repos to include pod name in message, i tried to update pod status, i couldnt get one in C++, i see one for C#. Thank you so much. |
@msgurikar |
I added this label to the
this is the yaml for
I modified this line https://github.com/msgurikar/grpc_bidirecional_client/blob/main/TestGrpcClient/TestGrpcClient/Program.cs#L50 to be
and create a new image for This is what I see. on the test client:
On the
On the
The Am I missing something ? |
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: sample-bidirection-client
spec:
template:
metadata:
name: sample-bidirection-client-1
annotations:
# Target 10 in-flight-requests per pod.
#autoscaling.knative.dev/target: "1"
# container-concurrency-target-percentage: "80"
autoscaling.knative.dev/targetUtilizationPercentage: "100"
#autoscaling.knative.dev/metric: "concurrency"
# autoscaling.knative.dev/initialScale: "0"
autoscaling.knative.dev/minScale: "1"
autoscaling.knative.dev/maxScale: "20"
autoscaling.knative.dev/scaleDownDelay: "10m"
spec:
containerConcurrency: 1
containers:
- name: sample-bidirection-client
image: samplebidirectionclient:latest
imagePullPolicy: Always
env:
- name: "GRPC_DNS_RESOLVER"
value: "native"
- name: GRPC_TRACE
value: "all"
- name: "GRPC_VERBOSITY"
value: "ERROR"
- name: MY_SERVICE_TARGET
value: mygrpcbidirservice.default.svc.cluster.local:80
- name: MY_SERVICE_TARGET_DEFAULT_AUTHORITY
value: mygrpcbidirservice.default.example.com
ports:
- name: h2c
containerPort: 40081
|
@jsanin-vmw making grpc-bidriection-service to restarts, not sure. |
Thanks for the ksvc definition @msgurikar I know why the
this mainly because knative does not want me to use
With this env var in place I was able to run the The behavior was the same as previously reported. All requests were managed correctly and no error message seen. I could not reproduce the issue reported. |
@jsanin-vmw |
This issue is stale because it has been open for 90 days with no |
Going to close this out - feel free to re-open if you're able to repro. |
Ask your question here:
Hi,
We have C++ grpc service running. We are using KNative serving to do autoscaling of pods based on number of input requests. Auto scale up and down happens nicely with KNative,
but we do have some requests getting error when the requests are going to terminating state pods, client gets stream error when the pods gets killed after termination grace period if the request is still in progress.
This doesn't happen always. We have observed that even some pods are in terminating state for about 5mins, new requests coming during this period will go to other pods or new pods get created,
but at times we see that new requests going into terminating pods that is causing error.
We had tried handling SIGTERM to do server shutdown, but didnt help much.
We see new requests going to terminating pods and error happening more frequently.
I wanted to understand, how do we make KNative to not to send new requests when the pod goes to terminating state.
Highly appreciate your suggestion.
Here is my service code:
std::unique_ptrgrpc::Server server;
//thread function
void doShutdown()
{
cout << "Entering doShutdown" << endl;
//getchar(); // press a key to shutdown the thread
auto deadline = std::chrono::system_clock::now() +
std::chrono::milliseconds(300);
server->Shutdown(deadline);
//server->Shutdown();
std::cout << "Server is shutting down. " << std::endl;
}
void signal_handler(int signal_num)
{
//std::lock_guardstd::recursive_mutex lock(server_mutex);
cout << "The interrupt signal is (" << signal_num
<< "). \n";
LOG_INFO(LogLayer::Application) << "The interrupt signal is " << signal_num;
switch (signal_num)
{
case SIGINT:
std::puts("It was SIGINT");
LOG_INFO(LogLayer::Application) << "It was SIGINT called";
break;
case SIGTERM:
std::puts("It was SIGTERM");
LOG_INFO(LogLayer::Application) << "It was SIGTERM called";
break;
default:
break;
}
// It terminates the program
LOG_INFO(LogLayer::Application) << "Calling Server Shutdown ";
cout << "Calling Server Shutdown" << endl;;
std::thread t = std::thread(doShutdown);
LOG_INFO(LogLayer::Application) << "Call exit() ";
cout << "Calling exit()" << endl;
t.join();
//exit(0);
}
int appMain(const variables_map &values)
{
const auto port = boost::any_caststd::string(values[Services_Common_Options::PORT].value());
MyServiceImpl my_service;
grpc::EnableDefaultHealthCheckService(true);
grpc::reflection::InitProtoReflectionServerBuilderPlugin();
grpc::ServerBuilder builder;
builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_TIME_MS, 1000 * 60 * 1);
builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_TIMEOUT_MS, 1000 * 10);
builder.AddChannelArgument(GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS, 1000 * 10);
builder.AddChannelArgument(GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA, 0);
builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS, 1);
//TODO: use secure SSL connection
builder.AddListeningPort(port, grpc::InsecureServerCredentials());
// Register "service" as the instance through which we'll communicate with
// clients. In this case it corresponds to an synchronous service.
builder.RegisterService(&my_service);
// Finally assemble the server.
server = builder.BuildAndStart();
LOG_INFO(LogLayer::Application) << SERVICE_NAME << " listening on " << port;
/std::signal(SIGTERM, signal_handler);
std::signal(SIGSEGV, signal_handler);
std::signal(SIGINT, signal_handler);
std::signal(SIGABRT, signal_handler);/
// Wait for the server to shutdown. Note that some other thread must be
// responsible for shutting down the server for this call to ever return.
cout << "Server waiting " << endl;
server->Wait();
LOG_INFO(LogLayer::Application) << "Server Shutdown ";
cout << "Server exited " << endl;
return 0;
}
and here is my KNative service yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: MyKnativeService
spec:
template:
metadata:
name: MyKnativeService-rev1
annotations:
Target 10 in-flight-requests per pod.
#autoscaling.knative.dev/target: "1"
container-concurrency-target-percentage: "80"
autoscaling.knative.dev/targetUtilizationPercentage: "100"
#autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/initialScale: "0"
autoscaling.knative.dev/minScale: "0"
autoscaling.knative.dev/maxScale: "100"
autoscaling.knative.dev/scaleDownDelay: "3m"
spec:
containerConcurrency: 1
containers:
image: ppfaservice:latest
imagePullPolicy: Always
The text was updated successfully, but these errors were encountered: