Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New grpc requests going to grpc service pods in terminating state. #13575

Closed
msgurikar opened this issue Jan 5, 2023 · 35 comments
Closed

New grpc requests going to grpc service pods in terminating state. #13575

msgurikar opened this issue Jan 5, 2023 · 35 comments
Labels
kind/question Further information is requested lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/needs-user-input Issues which are waiting on a response from the reporter

Comments

@msgurikar
Copy link

Ask your question here:

Hi,
We have C++ grpc service running. We are using KNative serving to do autoscaling of pods based on number of input requests. Auto scale up and down happens nicely with KNative,
but we do have some requests getting error when the requests are going to terminating state pods, client gets stream error when the pods gets killed after termination grace period if the request is still in progress.
This doesn't happen always. We have observed that even some pods are in terminating state for about 5mins, new requests coming during this period will go to other pods or new pods get created,
but at times we see that new requests going into terminating pods that is causing error.
We had tried handling SIGTERM to do server shutdown, but didnt help much.
We see new requests going to terminating pods and error happening more frequently.
I wanted to understand, how do we make KNative to not to send new requests when the pod goes to terminating state.
Highly appreciate your suggestion.

Here is my service code:
std::unique_ptrgrpc::Server server;

//thread function
void doShutdown()
{
cout << "Entering doShutdown" << endl;

//getchar(); // press a key to shutdown the thread
auto deadline = std::chrono::system_clock::now() +
std::chrono::milliseconds(300);
server->Shutdown(deadline);
//server->Shutdown();
std::cout << "Server is shutting down. " << std::endl;
}

void signal_handler(int signal_num)
{
//std::lock_guardstd::recursive_mutex lock(server_mutex);
cout << "The interrupt signal is (" << signal_num
<< "). \n";
LOG_INFO(LogLayer::Application) << "The interrupt signal is " << signal_num;

switch (signal_num)
{
case SIGINT:
std::puts("It was SIGINT");
LOG_INFO(LogLayer::Application) << "It was SIGINT called";
break;
case SIGTERM:
std::puts("It was SIGTERM");
LOG_INFO(LogLayer::Application) << "It was SIGTERM called";
break;
default:
break;
}

// It terminates the program

LOG_INFO(LogLayer::Application) << "Calling Server Shutdown ";
cout << "Calling Server Shutdown" << endl;;
std::thread t = std::thread(doShutdown);
LOG_INFO(LogLayer::Application) << "Call exit() ";
cout << "Calling exit()" << endl;
t.join();
//exit(0);
}

int appMain(const variables_map &values)
{
const auto port = boost::any_caststd::string(values[Services_Common_Options::PORT].value());

MyServiceImpl my_service;

grpc::EnableDefaultHealthCheckService(true);
grpc::reflection::InitProtoReflectionServerBuilderPlugin();
grpc::ServerBuilder builder;

builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_TIME_MS, 1000 * 60 * 1);
builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_TIMEOUT_MS, 1000 * 10);
builder.AddChannelArgument(GRPC_ARG_HTTP2_MIN_SENT_PING_INTERVAL_WITHOUT_DATA_MS, 1000 * 10);
builder.AddChannelArgument(GRPC_ARG_HTTP2_MAX_PINGS_WITHOUT_DATA, 0);
builder.AddChannelArgument(GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS, 1);

//TODO: use secure SSL connection
builder.AddListeningPort(port, grpc::InsecureServerCredentials());
// Register "service" as the instance through which we'll communicate with
// clients. In this case it corresponds to an synchronous service.
builder.RegisterService(&my_service);
// Finally assemble the server.
server = builder.BuildAndStart();

LOG_INFO(LogLayer::Application) << SERVICE_NAME << " listening on " << port;
/std::signal(SIGTERM, signal_handler);
std::signal(SIGSEGV, signal_handler);
std::signal(SIGINT, signal_handler);
std::signal(SIGABRT, signal_handler);
/

// Wait for the server to shutdown. Note that some other thread must be
// responsible for shutting down the server for this call to ever return.
cout << "Server waiting " << endl;
server->Wait();
LOG_INFO(LogLayer::Application) << "Server Shutdown ";
cout << "Server exited " << endl;

return 0;
}

and here is my KNative service yaml
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: MyKnativeService
spec:
template:
metadata:
name: MyKnativeService-rev1
annotations:

Target 10 in-flight-requests per pod.

#autoscaling.knative.dev/target: "1"

container-concurrency-target-percentage: "80"

autoscaling.knative.dev/targetUtilizationPercentage: "100"
#autoscaling.knative.dev/metric: "concurrency"

autoscaling.knative.dev/initialScale: "0"

autoscaling.knative.dev/minScale: "0"
autoscaling.knative.dev/maxScale: "100"
autoscaling.knative.dev/scaleDownDelay: "3m"
spec:
containerConcurrency: 1
containers:

  • name: MyKnativeService_container
    image: ppfaservice:latest
    imagePullPolicy: Always
@msgurikar msgurikar added the kind/question Further information is requested label Jan 5, 2023
@dprotaso
Copy link
Member

dprotaso commented Jan 5, 2023

Related is your discussion topic: https://github.com/knative/serving/discussions/13516

@dprotaso
Copy link
Member

dprotaso commented Jan 5, 2023

Can you make a sample repo with instructions/Dockerfile to create a container with the app?

/triage needs-user-input

@knative-prow knative-prow bot added the triage/needs-user-input Issues which are waiting on a response from the reporter label Jan 5, 2023
@psschwei
Copy link
Contributor

psschwei commented Jan 5, 2023

Discussion topic duplicating this issue: https://github.com/knative/serving/discussions/13572

@msgurikar
Copy link
Author

Thank you. I will create sample repo and share the details.

@msgurikar
Copy link
Author

Hi,
Here is the sample repos

  1. Grpc Service => https://github.com/msgurikar/grpc_bidirectional_server
    It contains C++ Grpc bidirectional server code that is KNative enabled service to do auto scale up/down.

  2. Grpc Client => https://github.com/msgurikar/grpc_bidirecional_client/tree/main/SampleGrpcBiDirection
    It contains C# Grpc client invoking/calling multiple requests to above C++ Grpc bidirectional service code.

  3. Test Client => https://github.com/msgurikar/grpc_bidirecional_client/tree/main/TestGrpcClient
    It contains C# Console app code calling C# Grpc client bidirectional code.

Thank you.

@msgurikar
Copy link
Author

Hi Dave,
Any suggestions to fix the above issue.
I see this issue when we send 20-30 requests at different intervals when some pods are in terminating state.

Thank you.

@jsanin-vmw
Copy link

hey @msgurikar , thanks for putting together this example. I am going through it. I started with Grpc Service repo but I am having problems creating the docker image.

☁  grpc_bidirectional_server [main] ⚡  docker build .                                     
[+] Building 1.4s (6/6) FINISHED                                                                                                        
 => [internal] load build definition from Dockerfile                                                                               0.0s
 => => transferring dockerfile: 37B                                                                                                0.0s
 => [internal] load .dockerignore                                                                                                  0.0s
 => => transferring context: 2B                                                                                                    0.0s
 => CANCELED [internal] load metadata for docker.io/library/ubuntu:20.04                                                           1.3s
 => ERROR [internal] load metadata for docker.io/library/grpcbase:latest                                                           1.3s
 => [auth] library/grpcbase:pull token for registry-1.docker.io                                                                    0.0s
 => [auth] library/ubuntu:pull token for registry-1.docker.io                                                                      0.0s
------
 > [internal] load metadata for docker.io/library/grpcbase:latest:
------
failed to solve with frontend dockerfile.v0: failed to create LLB definition: pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed

it seems grpcbase image is not found.
Will you provide additional steps of how to get the server running ?
thanks

@msgurikar
Copy link
Author

Hi,
Thank you for the response.
Yes. I have placed Dockerfile_vcpkg to create grpcbase image.

Once service image is created, need deploy it as knative service knative_service.yaml and it will run and waiitng for input grpc requests.
Let me know if you need other details.

Thank you.

@jsanin-vmw
Copy link

@msgurikar
Now the error is:

☁  grpc_bidirectional_server [main] ⚡  docker build .                                                 
[+] Building 3.1s (15/17)                                                                                                               
 => [internal] load build definition from Dockerfile                                                                               0.0s
 => => transferring dockerfile: 491B                                                                                               0.0s
 => [internal] load .dockerignore                                                                                                  0.0s
 => => transferring context: 2B                                                                                                    0.0s
 => [internal] load metadata for docker.io/library/ubuntu:20.04                                                                    1.1s
 => [internal] load metadata for docker.io/library/grpcbase:latest                                                                 0.0s
 => [auth] library/ubuntu:pull token for registry-1.docker.io                                                                      0.0s
 => [internal] load build context                                                                                                  0.0s
 => => transferring context: 6.13kB                                                                                                0.0s
 => [runtime 1/4] FROM docker.io/library/ubuntu:20.04@sha256:4a45212e9518f35983a976eead0de5eecc555a2f047134e9dd2cfc589076a00d      0.0s
 => CACHED [builder 1/7] FROM docker.io/library/grpcbase                                                                           0.0s
 => CACHED [runtime 2/4] WORKDIR /usr/local/bin                                                                                    0.0s
 => [builder 2/7] COPY . /usr/src/samplegrpc                                                                                       0.0s
 => [builder 3/7] WORKDIR /usr/src/samplegrpc                                                                                      0.0s
 => [builder 4/7] RUN mkdir build                                                                                                  0.2s
 => [builder 5/7] WORKDIR /usr/src/samplegrpc/build                                                                                0.0s
 => [builder 6/7] RUN cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_TOOLCHAIN_FILE=/usr/src/vcpkg/scripts/buildsystems/vcpkg.cmake ..   1.2s
 => ERROR [builder 7/7] RUN cmake --build .                                                                                        0.4s
------                                                                                                                                  
 > [builder 7/7] RUN cmake --build .:                                                                                                   
#15 0.404 [ 10%] Running grpc protocol buffer compiler on /usr/src/samplegrpc/ProtoApi/SampleService.proto. Custom options:             
#15 0.416 /usr/src/samplegrpc/ProtoApi/../ProtoApi/gen/ProtoApi/: No such file or directory                                             
#15 0.417 make[2]: *** [ProtoApi/CMakeFiles/SampleProtoApi.dir/build.make:79: ../ProtoApi/gen/ProtoApi/SampleService.grpc.pb.h] Error 1 
#15 0.417 make[1]: *** [CMakeFiles/Makefile2:114: ProtoApi/CMakeFiles/SampleProtoApi.dir/all] Error 2
#15 0.418 make: *** [Makefile:84: all] Error 2
------
executor failed running [/bin/sh -c cmake --build .]: exit code: 2
☁  grpc_bidirectional_server [main] ⚡  

@msgurikar
Copy link
Author

Sorry about that. I have fixed it and pushed the changes into https://github.com/msgurikar/grpc_bidirectional_server
this should build now.

Thank you.

@jsanin-vmw
Copy link

@msgurikar

So, under grpc_bidirectional_server I created sample-grpc-service docker image. This container exposed port 40056.

On the other hand we have grpc_bidirecional_client/SampleGrpcBiDirection. I created client-samplegrpc-bidirection docker image based on the Dockerfile in that folder. This exposes port 40081

Then we have this: grpc_bidirecional_client/TestGrpcClient (which did not have Dockerfile), I went ahead and created a Dockerfile based on the Dockerfile present in SampleGrpcBiDirection. This is the Dockerfile I endup with:

#See https://aka.ms/containerfastmode to understand how Visual Studio uses this Dockerfile to build your images for faster debugging.

FROM mcr.microsoft.com/dotnet/runtime:6.0 AS base
WORKDIR /app

FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
WORKDIR /src
COPY ["TestGrpcClient/TestGrpcClient.csproj", "TestGrpcClient/"]
RUN dotnet restore "TestGrpcClient/TestGrpcClient.csproj"
COPY . .
WORKDIR "/src/TestGrpcClient"
RUN dotnet build "TestGrpcClient.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "TestGrpcClient.csproj" -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
# ENV QI_ALGO_SERVICE_TARGET="127.0.0.1:40056"
# EXPOSE 40081
ENTRYPOINT ["dotnet", "TestGrpcClient.dll"]

with this I created the docker image: test-grpc-client which by the code in https://github.com/msgurikar/grpc_bidirecional_client/blob/main/TestGrpcClient/TestGrpcClient/Program.cs#L50 I concluded this will call the server in SampleGrpcBiDirection.

I started by running them in my local laptop, like this:

⚡  docker run --rm -p 40081:40081 --network host -it client-samplegrpc-bidirection
WARNING: Published ports are discarded when using host network mode
Current director is /app/Logs
server listening on port 40081
info: GrpcHostedService[0]
      server listening on port 40081
info: Microsoft.Hosting.Lifetime[0]
      Application started. Press Ctrl+C to shut down.
info: Microsoft.Hosting.Lifetime[0]
      Hosting environment: Production
info: Microsoft.Hosting.Lifetime[0]
      Content root path: /app/

And then run the client like so:

☁  TestGrpcClient [main] ⚡  docker run --rm --network host -it test-grpc-client
Hello, World!
Press 9 to close, any other keys to start computation
o
Eneter number of requests to send to KNative grpc service
10
Exception throw Status(StatusCode="Unknown", Detail="Exception was thrown by handler.", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.178171751","description":"Error received from peer ipv4:127.0.0.1:40081","file":"/var/local/git/grpc/src/core/lib/surface/call.cc","file_line":953,"grpc_message":"Exception was thrown by handler.","grpc_status":2}")
Press 9 to close, any other keys to start computation


the output on the server console is:

Compute request recieved with number of requests is 10
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.135249079","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.135249193","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.135253498","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.135264331","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing  1
Error occured while computing  2

Error occured while computing  1
Error occured while computing  2
Error occured while computing  0

Error occured while computing  1
Error occured while computing  2
Error occured while computing  0
Error occured while computing  9

Error occured while computing  1
Error occured while computing  2
Error occured while computing  0
Error occured while computing  9

SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.158793765","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing  1
Error occured while computing  2
Error occured while computing  0
Error occured while computing  9
Error occured while computing  7

SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.159091995","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing  1
Error occured while computing  2
Error occured while computing  0
Error occured while computing  9
Error occured while computing  7
Error occured while computing  5

SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.159304945","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing  1
Error occured while computing  2
Error occured while computing  0
Error occured while computing  9
Error occured while computing  7
Error occured while computing  5
Error occured while computing  4

SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.159472847","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing  1
Error occured while computing  2
Error occured while computing  0
Error occured while computing  9
Error occured while computing  7
Error occured while computing  5
Error occured while computing  4
Error occured while computing  3

SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.158438725","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
SampleService compute failed Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.158922477","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")
Error occured while computing  1
Error occured while computing  2
Error occured while computing  0
Error occured while computing  9
Error occured while computing  7
Error occured while computing  5
Error occured while computing  4
Error occured while computing  3
Error occured while computing  6

Error occured while computing  1
Error occured while computing  2
Error occured while computing  0
Error occured while computing  9
Error occured while computing  7
Error occured while computing  5
Error occured while computing  4
Error occured while computing  3
Error occured while computing  6
Error occured while computing  8

One or more Compute tasks have been failed due to Status(StatusCode="Internal", Detail="Failed to create secure client channel", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676323513.135264331","description":"lame client channel","file":"/var/local/git/grpc/src/core/lib/surface/lame_client.cc","file_line":184,"grpc_message":"Failed to create secure client channel","grpc_status":13}")

Questions:

  1. Did I miss something ? how all this is tied together ?
  2. Is there a client for testing server running on 40056 ?

@msgurikar
Copy link
Author

msgurikar commented Feb 13, 2023 via email

@jsanin-vmw
Copy link

@msgurikar

I deployed the sample server as a knative service.

I have been executing the test over and over again and I have not got any failures.
Although, I have noticed the pods take a lot of time Terminating. I noticed the same when I run it in my local Docker desktop as a single container.

I am running this in a 3 node GKE cluster with 4vCPU and 16Gb each.
I am using Contour to access the service remotely.

Questions:

  1. Could you provide additional info of your environment ?
  2. Are you using Contour for accessing your kservice ?
  3. Why the server program is taking too long to Terminate ?

@jsanin-vmw
Copy link

For instance, this is how the list of pods looks after some minutes without sending any requests:

NAME                                                    READY   STATUS        RESTARTS   AGE
sample-grpc-service-00001-deployment-5b6cb86d8b-2kt5r   2/2     Running       0          17m
sample-grpc-service-00001-deployment-5b6cb86d8b-2n8tv   1/2     Terminating   0          13m
sample-grpc-service-00001-deployment-5b6cb86d8b-4hgkt   1/2     Terminating   0          9m12s
sample-grpc-service-00001-deployment-5b6cb86d8b-4hnnm   2/2     Running       0          17m
sample-grpc-service-00001-deployment-5b6cb86d8b-4wdvk   2/2     Running       0          17m
sample-grpc-service-00001-deployment-5b6cb86d8b-4xjdm   2/2     Running       0          13m
sample-grpc-service-00001-deployment-5b6cb86d8b-55ldw   1/2     Terminating   0          9m12s
sample-grpc-service-00001-deployment-5b6cb86d8b-9gmxw   1/2     Terminating   0          9m12s
sample-grpc-service-00001-deployment-5b6cb86d8b-9ls82   2/2     Running       0          17m
sample-grpc-service-00001-deployment-5b6cb86d8b-9mzq4   1/2     Terminating   0          13m
sample-grpc-service-00001-deployment-5b6cb86d8b-9p9v5   1/2     Terminating   0          9m11s
sample-grpc-service-00001-deployment-5b6cb86d8b-c59jb   1/2     Terminating   0          13m
sample-grpc-service-00001-deployment-5b6cb86d8b-mdtjl   1/2     Terminating   0          13m
sample-grpc-service-00001-deployment-5b6cb86d8b-nk46k   1/2     Terminating   0          9m12s
sample-grpc-service-00001-deployment-5b6cb86d8b-q72c6   1/2     Terminating   0          9m12s
sample-grpc-service-00001-deployment-5b6cb86d8b-rvwcp   1/2     Terminating   0          9m12s
sample-grpc-service-00001-deployment-5b6cb86d8b-scvsv   1/2     Terminating   0          9m12s
sample-grpc-service-00001-deployment-5b6cb86d8b-slfs7   1/2     Terminating   0          9m11s
sample-grpc-service-00001-deployment-5b6cb86d8b-w5wj6   1/2     Terminating   0          9m11s
sample-grpc-service-00001-deployment-5b6cb86d8b-x9k6n   1/2     Terminating   0          13m

IMHO, it should not take that long to terminate the process.

@msgurikar
Copy link
Author

@jsanin-vmw
Thank you for trying and sharing the details. Here are my answers

Questions:

Could you provide additional info of your environment ?
=> I am running in Azure Kubernetes Service cluster and Knative is setup on 8CPU node. There are many nodes in AKS cluster.
Are you using Contour for accessing your kservice ?
=> We are using Kourier ingress
Why the server program is taking too long to Terminate
=> Not sure, C++ grpc service doesnt get terminated when pod goes to terminating state. It waits till it recieves KILL signal from Kuberntes, till then it keep accepting new incoming requests. I wanted to understand How i can make Kantive pod not to accept new coming requests when pod is goes to terminating state. I tried handling SIGTERM and do grpc server shutdown, it didnt help much. Do you have any suggestion on this.

Have you tried sending 30 requests and wait for some pods to get terminate and send another 20 requests and check, if we repeat this step for 4-5 times, i am getting this issue.

Thank you.

@jsanin-vmw
Copy link

How do you know that pods in Terminating state still get requests ?

I saw that once they go to Terminating the readiness probe start receiving 503 response code. Let me get the message again.

Have you tried sending 30 requests and wait for some pods to get terminate and send another 20 requests and check.

It only allows me to send 20 req max.

@msgurikar
Copy link
Author

I was checking logs of terminating pods,since it takes while to get killed, I get error on sample grpc bidirection client side.
Ok.

Thank you.

@jsanin-vmw
Copy link

by checking the status of the pod when it goes to Terminating I see this:

  Normal   Created      4m37s              kubelet            Created container queue-proxy
  Normal   Started      4m37s              kubelet            Started container queue-proxy
  Normal   Killing      39s                kubelet            Stopping container sample-grpc-service
  Normal   Killing      39s                kubelet            Stopping container queue-proxy
  Warning  Unhealthy    19s (x3 over 39s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy    9s                 kubelet            Readiness probe failed: Get "http://10.224.1.160:8013/": dial tcp 10.224.1.160:8013: connect: connection refused

another example is this:

  Normal   Created      5m22s              kubelet            Created container queue-proxy
  Normal   Started      5m22s              kubelet            Started container queue-proxy
  Normal   Killing      83s                kubelet            Stopping container sample-grpc-service
  Normal   Killing      83s                kubelet            Stopping container queue-proxy
  Warning  Unhealthy    53s (x3 over 73s)  kubelet            Readiness probe failed: HTTP probe failed with statuscode: 503
  Warning  Unhealthy    3s (x5 over 43s)   kubelet            Readiness probe failed: Get "http://10.224.1.156:8013/": dial tcp 10.224.1.156:8013: connect: connection refused

@jsanin-vmw
Copy link

I was checking logs of terminating pods,since it takes while to get killed, I get error on sample grpc bidirection client side.

I have not received any errors from the sample grpc bidirection client side . I always get:

WriteRequest completed success.

even though there are a bunch of Terminating pods.

@msgurikar
Copy link
Author

Ok. You dont see new incoming requests going to Terminating pods. I could see it on my side, and at times when terminating pod serving the request gets Killed after graceperiod, then i get the error.
Thank you for trying out.

@jsanin-vmw
Copy link

Will you be able to modify your grpc server code to include in the response message the POD name as well as its status ?
Pod name can be got from SERVING_POD env variable. And its status as described in here: https://kubernetes.io/docs/tasks/run-application/access-api-from-pod/ . An example in Go can be found here: https://github.com/kubernetes/client-go/blob/master/examples/in-cluster-client-configuration/main.go#L62
I would guess a similar thing can be done in C++

In the mean time, I will give it a try in AKS with Kourier.

@msgurikar
Copy link
Author

Sure. I will modify server code and update you once its updated in git.
Thank you so much.

@jsanin-vmw
Copy link

you might want to use this to get the pod name.
https://stackoverflow.com/questions/73318743/kubernetes-how-to-get-pod-name-im-running-in

@jsanin-vmw
Copy link

@msgurikar

I wanted to report that I tested it in

  • AKS
  • Kourier
  • knative 1.9
  • 3 node cluster, with 4vCPU and 16Mb each

I observed the same behavior reported earlier.
No errors.

The knative service is running on the cluster.

I am running locally the client and the test. I am running them as local docker containers that would connect to the remote ksvc.

I went ahead and created another instance of the tests, and like so I was able to run 40 req simultaneously.
I also waited for the pods to go to Terminating state with 2/2 containers running. They stayed like for 5 to 10 seconds and then go to 1/2 containers Terminating .
Trigger 40 requests and No issues faced.

I did this several times.

@msgurikar
Copy link
Author

@jsanin-vmw
Ok. Thank you so much for all the tests. Maybe something wrong on our setup or so. I will try again on my side.
I really appreciate it. Thank you.
One thing i want to get clear is:
"When some pods goes to terminating state(in a case when there are 20 pods), do new incoming requests(around 20 new requests) going to terminating pods or Not?"

@jsanin-vmw
Copy link

"When some pods goes to terminating state(in a case when there are 20 pods), do new incoming requests(around 20 new requests) going to terminating pods or Not?"

No.

I would like to give more details about how I am testing this.

  1. Create the server image:
    Go to grpc_bidirectional_server
    Build the base image like so:
docker build -t grpcbase -f Dockerfile_vcpkg .

Then build the grpc service image:

docker build -t sample-grpc-service . 

Push this image to a registry.

Create a kn service with this image:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sample-grpc-service
spec:
  template:
    metadata:
      #      name: sample-grpc-service_1
      annotations:
        # Target 10 in-flight-requests per pod.
        #autoscaling.knative.dev/target: "1"
        # container-concurrency-target-percentage: "80"
        autoscaling.knative.dev/targetUtilizationPercentage: "100"
        #autoscaling.knative.dev/metric: "concurrency"
        # autoscaling.knative.dev/initialScale: "0"
        autoscaling.knative.dev/minScale: "0"
        autoscaling.knative.dev/maxScale: "100"
        autoscaling.knative.dev/scaleDownDelay: "3m"
    spec:
      containerConcurrency: 1
      containers:
        - name: sample-grpc-service
          image: <YOUR-REGISTRY>/sample-grpc-service:latest
          imagePullPolicy: Always
          ports:
            - name: h2c
              containerPort: 40056

Patch the ConfigMap config-domain with the domain of your cluster

kubectl patch configmap/config-domain \                                                  
  --namespace knative-serving \     
  --type merge \      
  --patch '{"data":{"<YOUR-DOMAIN>":""}}'

Setup your DNS or your /etc/hosts to map this domain to your public cluster IP address.

Get your ksvc host name:

k get ksvc
NAME                  URL                                                   LATESTCREATED               LATESTREADY                 READY   REASON
sample-grpc-service   http://sample-grpc-service.default.<YOUR-DOMAIN>      sample-grpc-service-00001   sample-grpc-service-00001   True    

  1. Create the grpc_bidirecional_client and the test grpc client
    Go to grpc_bidirecional_client/SampleGrpcBiDirection
    Create the image:
docker build -t client-samplegrpc-bidirection . 

Now go to grpc_bidirecional_client/TestGrpcClient
Use this Dockerfile to create this image


FROM mcr.microsoft.com/dotnet/runtime:6.0 AS base
WORKDIR /app

FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
WORKDIR /src
COPY ["TestGrpcClient/TestGrpcClient.csproj", "TestGrpcClient/"]
RUN dotnet restore "TestGrpcClient/TestGrpcClient.csproj"
COPY . .
WORKDIR "/src/TestGrpcClient"
RUN dotnet build "TestGrpcClient.csproj" -c Release -o /app/build

FROM build AS publish
RUN dotnet publish "TestGrpcClient.csproj" -c Release -o /app/publish

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "TestGrpcClient.dll"]

Create the image

docker build -t test-grpc-client .
  1. Run the test

Run the grpc_bidirecional_client in one terminal

docker run --rm --network host -e MY_SERVICE_TARGET=sample-grpc-service.default.<YOUR-DOMAIN>:80 -it client-samplegrpc-bidirection

Now run the test-grpc-client in other terminal

docker run --rm --network host -it test-grpc-client

Run the tests with the # request you want to create.

The current test-grpc-client limit the # of request to max 20. If you want to create more request, open a new terminal, run another test-grpc-client and trigger more requests.

I will put this down for now. I was not able to reproduce the behavior you reported after testing this in GKE and AKS.

Let me know when you modify the server to message back with the POD name and its status.

@msgurikar
Copy link
Author

@jsanin-vmw
Thank you for explaining your tests in detail.
In my case, i deploy grpc_bidirectional_service and grpc_bidirectional_client as KNative services in AKS and call grpc_bidirectional_client from Testclient locally.
Its the same steps what you explained, i am just wondering how in my case same code running in AKS making new incoming requests going to terminating pods. I dont see this issue when i run few requests Eg: 5 requests, 5 pods get created and run, after 5mins, 2 pods start terminating, if i send new requests of 5, then new 5 pods gets created, no issue here.
Problem comes, when i send requests for 20, 20 pods gets created and running, once requests are completed, after 5mins, few pods(10) starts terminating, now when i send 20 new requests, then some of 20 new requests goes to terminating pod and some will create new pods. I am not able to figure out why some requests are going to terminating pods,

I have updated grpc_bidirectional_service and client repos to include pod name in message, i tried to update pod status, i couldnt get one in C++, i see one for C#.

Thank you so much.

@jsanin-vmw
Copy link

@msgurikar
could you share your yaml for creating the grpc_bidirectional_client kn service ?

@jsanin-vmw
Copy link

@msgurikar

I added this label to the sample-grpc-service service, since it is going to be access locally only.

  labels:
    networking.knative.dev/visibility: cluster-local

this is the yaml for client-samplegrpc-bidirection

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: client-samplegrpc-bidirection
spec:
  template:
    metadata:
      #      name: client-samplegrpc-bidirection
      annotations:
        # Target 10 in-flight-requests per pod.
        #autoscaling.knative.dev/target: "1"
        # container-concurrency-target-percentage: "80"
        autoscaling.knative.dev/targetUtilizationPercentage: "100"
        #autoscaling.knative.dev/metric: "concurrency"
        # autoscaling.knative.dev/initialScale: "0"
        autoscaling.knative.dev/minScale: "0"
        autoscaling.knative.dev/maxScale: "100"
        autoscaling.knative.dev/scaleDownDelay: "3m"
    spec:
      containerConcurrency: 1
      containers:
        - name: client-samplegrpc-bidirection
          image: <YOUR-REGISTRY>client-samplegrpc-bidirection:latest
          imagePullPolicy: Always
          ports:
            - name: h2c
              containerPort: 40081
          env:
#            - name: POD_NAME
#              valueFrom:
#                fieldRef:
#                  fieldPath: metadata.name
            - name: MY_SERVICE_TARGET
              value: "sample-grpc-service.default.svc.cluster.local:80"

I modified this line https://github.com/msgurikar/grpc_bidirecional_client/blob/main/TestGrpcClient/TestGrpcClient/Program.cs#L50 to be

var channel = new Channel("client-samplegrpc-bidirection.default.<YOUR-DOMAIN>:80", ChannelCredentials.Insecure);

and create a new image for test-grpc-client . Run this from my laptop. But I am seeing errors for a single request.

This is what I see.

on the test client:

Press 9 to close, any other keys to start computation
1
Eneter number of requests to send to KNative grpc service
1
Exception throw Status(StatusCode="Unknown", Detail="Exception was thrown by handler.", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676559847.567200700","description":"Error received from peer ipv4:35.222.83.190:80","file":"/var/local/git/grpc/src/core/lib/surface/call.cc","file_line":953,"grpc_message":"Exception was thrown by handler.","grpc_status":2}")
Press 9 to close, any other keys to start computation

On the client-samplegrpc-bidirection

Compute request recieved with number of requests is 1
WriteRequest completed success.
Error occured while computing  0

One or more Compute tasks have been failed due to Status(StatusCode="Internal", Detail="Received RST_STREAM with error code 0", DebugException="Grpc.Core.Internal.CoreErrorDetailException: {"created":"@1676559847.682505929","description":"Error received from peer ipv4:10.221.46.94:80","file":"/var/local/git/grpc/src/core/lib/surface/call.cc","file_line":953,"grpc_message":"Received RST_STREAM with error code 0","grpc_status":13}")

On the sample-grpc-service

Defaulted container "sample-grpc-service" out of: sample-grpc-service, queue-proxy
Server waiting 
--------------------------- Start of request-----------------
Request recived Message is 0
Input message is Message is 0
ReadCancel-- Process completed.
Request Completed Message is 0
Process completed and took 5.0225seconds
--------------------------- End of request-----------------
free(): double free detected in tcache 2

The sample-grpc-service exited after this. And the pod restarts.

Am I missing something ?

@msgurikar
Copy link
Author

msgurikar commented Feb 16, 2023

@msgurikar could you share your yaml for creating the grpc_bidirectional_client kn service ?

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sample-bidirection-client
spec:
  template:
    metadata:
      name: sample-bidirection-client-1
      annotations:
        # Target 10 in-flight-requests per pod.
        #autoscaling.knative.dev/target: "1"
        # container-concurrency-target-percentage: "80"
        autoscaling.knative.dev/targetUtilizationPercentage: "100"
        #autoscaling.knative.dev/metric: "concurrency"
        # autoscaling.knative.dev/initialScale: "0"
        autoscaling.knative.dev/minScale: "1"
        autoscaling.knative.dev/maxScale: "20"
        autoscaling.knative.dev/scaleDownDelay: "10m"
    spec:
      containerConcurrency: 1
      containers:
      - name: sample-bidirection-client
        image: samplebidirectionclient:latest
        imagePullPolicy: Always
        env:       
          - name: "GRPC_DNS_RESOLVER"
            value: "native"
          - name: GRPC_TRACE
            value: "all"
          - name: "GRPC_VERBOSITY"
            value: "ERROR"          
          - name: MY_SERVICE_TARGET          
            value:  mygrpcbidirservice.default.svc.cluster.local:80             
          - name: MY_SERVICE_TARGET_DEFAULT_AUTHORITY
            value: mygrpcbidirservice.default.example.com          
        ports:
          - name: h2c
            containerPort: 40081
      

@msgurikar
Copy link
Author

@jsanin-vmw
Changing this
labels:
networking.knative.dev/visibility: cluster-local

making grpc-bidriection-service to restarts, not sure.
Message is processed correctly on service side, but why its restarting, unable to understand.

@jsanin-vmw
Copy link

Thanks for the ksvc definition @msgurikar

I know why the sample-grpc-service was stopping. It was because I did not set the POD_NAME env var. I set it to some hard coded value like so:

          env:
            - name: POD_NAME
              value: "SOME_POD_NAME"

this mainly because knative does not want me to use fieldRef on env vars. Kn error:

Error from server (BadRequest): error when creating "sample-grpc-service-ksvc.yaml": admission webhook "validation.webhook.serving.knative.dev" denied the request: validation failed: must not set the field(s): spec.template.spec.containers[0].env[0].valueFrom.fieldRef

With this env var in place I was able to run the sample-grpc-service and the client-samplegrpc-bidirection on the cluster. While the test-grpc-client runs on my machine.

The behavior was the same as previously reported. All requests were managed correctly and no error message seen.

I could not reproduce the issue reported.

@msgurikar
Copy link
Author

@jsanin-vmw
Ah.. ok.
Thank you for the confirmation. I need to check on my setup.

@github-actions
Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 24, 2023
@dprotaso
Copy link
Member

Going to close this out - feel free to re-open if you're able to repro.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/question Further information is requested lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. triage/needs-user-input Issues which are waiting on a response from the reporter
Projects
None yet
Development

No branches or pull requests

4 participants