Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knative Timeout Issues with Long-Running Requests #15352

Closed
nkreiger opened this issue Jun 26, 2024 · 10 comments
Closed

Knative Timeout Issues with Long-Running Requests #15352

nkreiger opened this issue Jun 26, 2024 · 10 comments
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@nkreiger
Copy link

nkreiger commented Jun 26, 2024

In what area(s)?

/area autoscale

What version of Knative?

1.13.1

Expected Behavior

Knative Service Pod and Queue Proxy respect the revision timeout passed.

Actual Behavior

Knative queue-proxy throws error:

{"severity":"ERROR","timestamp":"2024-06-25T23:47:00.022775794Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 13\nTCP: inuse 3 orphan 4 tw 14 alloc 2483 mem 309\nUDP: inuse 0 mem 8\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"41769de","knative.dev/key":"test/plugin-github-00008","knative.dev/pod":"plugin-github-00008-deployment-6749cf875b-v9dhz","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}

This causes an endless amount of retries resulting in:

plugin-github-00008-deployment-6749cf875b-jm55n                   1/2     Terminating         0               123m
plugin-github-00008-deployment-6749cf875b-lbpsf                   1/2     Terminating         0               5h16m
plugin-github-00008-deployment-6749cf875b-qxnhz                   1/2     Terminating         0               157m
plugin-github-00008-deployment-6749cf875b-rpm5g                   1/2     Terminating         0               3h58m
plugin-github-00008-deployment-6749cf875b-v9dhz                   2/2     Running             0               15m
# curl --location 'http://plugin-github-test.test.svc.cluster.local' \
--header 'Ce-Id: replay-189098f' \
--header 'Ce-Specversion: 1.0' \
--header 'Ce-Type: attestation.test' \
--header 'Ce-Source: ci.commit.codereview.test' \
--header 'Content-Type: application/json' \
--data '{}'> > > > > > 
activator request timeout# 

a bunch of pods that have to be manually deleted.

Steps to Reproduce the Problem

Apply the following configmap:

apiVersion: v1
data:
  enable-service-links: "false"
  max-revision-timeout-seconds: "604800"
  revision-idle-timeout-seconds: "0"
  revision-response-start-timeout-seconds: "600"
  revision-timeout-seconds: "18500"
kind: ConfigMap
metadata:
  annotations:
    knative.dev/example-checksum: e7973912
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"enable-service-links":"false","max-revision-timeout-seconds":"604800","revision-idle-timeout-seconds":"0","revision-response-start-timeout-seconds":"600","revision-timeout-seconds":"18500"},"kind":"ConfigMap","metadata":{"annotations":{"knative.dev/example-checksum":"e7973912"},"creationTimestamp":"2023-11-28T16:59:46Z","labels":{"app.kubernetes.io/component":"controller","app.kubernetes.io/name":"knative-serving","app.kubernetes.io/version":"1.13.1"},"name":"config-defaults","namespace":"knative-serving","resourceVersion":"488776669","uid":"2496e813-166c-4756-aa7f-f1a83f946cdd"}}
  creationTimestamp: "2023-11-28T16:59:46Z"
  labels:
    app.kubernetes.io/component: controller
    app.kubernetes.io/name: knative-serving
    app.kubernetes.io/version: 1.13.1
  name: config-defaults
  namespace: knative-serving
  resourceVersion: "488803732"
  uid: 2496e813-166c-4756-aa7f-f1a83f946cdd

Restart knative serving

kubectl delete pods -n knative-serving --all

Apply a new knative service, that has a delay in it

Apply a new trigger reading from a broker

Send a request to that broker, that will activate the trigger:

logs from service

Defaulted container "github-plugin" out of: github-plugin, queue-proxy
2024/06/26 00:58:16 /var/run/ko
2024/06/26 00:58:16 /var/run/ko
Using config file: /var/run/ko/.config.yaml
time="2024-06-26T00:58:17Z" level=info msg="beginnig to sleep."
time="2024-06-26T01:08:17Z" level=info msg="beginnig to sleep."

logs from queue proxy

{"severity":"ERROR","timestamp":"2024-06-26T01:08:15.397562115Z","logger":"queueproxy","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 9\nTCP: inuse 1 orphan 2 tw 12 alloc 207 mem 9\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"41769de","knative.dev/key":"test/plugin-github-test-00001","knative.dev/pod":"plugin-github-test-00001-deployment-76cd4695bf-nfzch","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/queue.(*appRequestMetricsHandler).ServeHTTP\n\tknative.dev/serving/pkg/queue/request_metric.go:199\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ProxyHandler.func3\n\tknative.dev/serving/pkg/queue/handler.go:76\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/queue/sharedmain.mainHandler.ForwardedShimHandler.func4\n\tknative.dev/serving/pkg/queue/forwarded_shim.go:54\nnet/http.HandlerFunc.ServeHTTP\n\tnet/http/server.go:2136\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}

timeout difference: 2024-06-26T00:58:17Z vs 2024-06-26T01:08:15.397562115Z

indicating a 10 minute timeout.

error from activator:

{"severity":"ERROR","timestamp":"2024-06-26T01:08:15.397202332Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 146\nTCP: inuse 8 orphan 2 tw 99 alloc 2264 mem 285\nUDP: inuse 0 mem 8\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"41769de","knative.dev/controller":"activator","knative.dev/pod":"activator-b7789d69d-lcwb7","knative.dev/key":"test/plugin-github-test-00001","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:141\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:475\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:144\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:97\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:237\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\tknative.dev/serving/pkg/activator/net/throttler.go:767\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:227\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:522\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:90\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}

activator termination

terminationGracePeriodSeconds: 604800

line of code in activator that starts timeout:
activator request timeou

seems to be referencing: https://github.com/knative/serving/blob/main/pkg/apis/config/defaults.go#L45

(the max timeout) instead of the specified max timeout from me.

More Validation

My knative service spec:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  annotations:
  creationTimestamp: "2024-06-25T02:32:13Z"
  generation: 8
  labels:
    app.fianu.io/type: occurrence
    app.kubernetes.io/component: adapter
    app.kubernetes.io/instance: github
    app.kubernetes.io/managed-by: fianu-controller
    app.kubernetes.io/name: plugin
    app.kubernetes.io/part-of: fianu
    argocd.argoproj.io/compare-options: IgnoreExtraneous
    destinations.fianu.io/broker: fianu-default
    networking.knative.dev/visibility: cluster-local
  name: plugin-github-test
  namespace: test
spec:
  template:
    metadata:
      annotations:
        argocd.argoproj.io/compare-options: IgnoreExtraneous
        sidecar.istio.io/proxyCPU: 50m
        sidecar.istio.io/proxyCPULimit: 150m
        sidecar.istio.io/proxyMemory: 128Mi
        sidecar.istio.io/proxyMemoryLimit: 256Mi
      creationTimestamp: null
    spec:
      containerConcurrency: 0
      containers:
      - env:
        - name: FIANU_PLATFORM_SEMANTIC_VERSION
          valueFrom:
            configMapKeyRef:
              key: fianu_semantic_version
              name: fianu-hosts
        - configMapRef:
            name: fianu-hosts-env
        image: docker.io
        imagePullPolicy: Always
        name: github-plugin
        readinessProbe:
          successThreshold: 1
          tcpSocket:
            port: 0
        resources: {}
        securityContext: {}
      responseStartTimeoutSeconds: 600
      serviceAccountName: github
      timeoutSeconds: 18500

timeoutSeconds doesn't seem to be respected.


I also built the activator from source, and added loggin:

Is what I added:

log.Println("making request with values ", time.Duration(*rev.Spec.TimeoutSeconds)*time.Second, responseStartTimeout, idleTimeout)

From Broker:

2024/06/26 01:46:14 makign request with values  168h0m0s 10m0s 0s

With Curl

2024/06/26 01:47:53 makign request with values  168h0m0s 10m0s 0s

So, it looks like the timeout is being set properly from the spec.TimeoutSeconds


My only thoughts now, are is my delivery timeout not being respected:

spec:
  broker: default
  delivery:
    timeout: PT2H
    retry: 1
  filter:
    attributes:
      source: ci.commit.codereview.test
      type: attestation.test
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service
      name: plugin-github-test
      namespace: test

But this does not make sense. Because I reproduce with curl:

# curl --location 'http://plugin-github-test.test.svc.cluster.local' \
--header 'Ce-Id: replay-189098f' \
--header 'Ce-Specversion: 1.0' \
--header 'Ce-Type: attestation.test' \
--header 'Ce-Source: ci.commit.codereview.test' \
--header 'Content-Type: application/json' \
--data '{}'> > > > > > 
activator request timeout# 

which wouldn't even go through kafka, and I get the same timeout.

Is there something I need to do on the kafka side?

@nkreiger nkreiger added the kind/bug Categorizes issue or PR as related to a bug. label Jun 26, 2024
@nkreiger
Copy link
Author

**UPDATE: ** I added extra logging to the activator deployment, and was able to logout the cancellation.

2024/06/26 02:20:10 timeoutHandler: request timed out with <-responseStartTimeoutCh

this indicates the channel being hit is:

https://github.com/knative/serving/blob/main/pkg/http/handler/timeout.go#L140

that is causing the cancellation. Running validation with extending this.

@nkreiger
Copy link
Author

After extending:

https://knative.dev/docs/serving/configuration/config-defaults/#revision-response-start-timeout-seconds

it no longer timed out.

I've now recorded 4 possible places that can cause a timeout:

  1. Client Side, the actual serverless function server (go-sdk had a default timeout of 10 minutes, now configurable in latest version)

  2. Trigger side (delivery.timeout)

  3. revision-timeout-seconds and max-revision-timeout-seconds -- this will cause a timeout in the queue-proxy

  4. revision-response-start-timeout-seconds -- this will cause a timeout in the activator

This can be closed if this is expected behavior. Maybe an enhancement of documentation is needed? Because revision-timeout-seconds and max-revision-timeout-seconds don't really matter how high you set it, if you haven't also increased the revision-response...

@CarlosDomingues
Copy link

I'm experiencing this very issue when upgrading to Knative 1.14.0 from an older version (1.3.2)

@nkreiger I've followed your suggestion to set revision-response-start-timeout-seconds to a higher value, but unfortunately that did not work for me. I'm still getting 5 min timeouts along with the activator request timeout error on the client side.

My configs, for testing (I'm using the Knative Operator):

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  config:
    defaults:
      enable-service-links: 'false'
      max-revision-timeout-seconds: '3600'
      revision-response-start-timeout-seconds: '3600'
      revision-timeout-seconds: '3600'
    features:
      kubernetes.podspec-init-containers: enabled
      kubernetes.podspec-persistent-volume-claim: enabled
      kubernetes.podspec-persistent-volume-write: enabled

@nkreiger
Copy link
Author

@CarlosDomingues did you confirm that the activator restarted?

I usually check ^ and do a kubectl describe... on the deployment or knative service pod after applying it to make sure it is now mapping in the appropriate time.

You can check the knative services / edit the value directly by setting it to confirm that it has been mapped appropriately. My guess, is it didn't kick in yet.

responseStartTimeoutSeconds: 18500
serviceAccountName: github
timeoutSeconds: 18500

@CarlosDomingues
Copy link

CarlosDomingues commented Jun 26, 2024

@nkreiger Thanks for the help!

After seeing your comment on Slack I recreated the whole cluster / reinstalled Knative just in case, but unfortunately that did not work.

I think the next step is trying to check the service itself.

@nkreiger
Copy link
Author

@nkreiger Thanks for the help!

After seeing your comment on Slack I recreated the whole cluster / reinstalled Knative just in case, but unfortunately that did not work.

I think the next step is trying to check the service itself.

Yes, I would check that. For me, the defaults didn't kick in -- I had to explicitly set it on the service in some namespaces. (still investigating why it worked for others...)

@CarlosDomingues
Copy link

I had to explicitly set it on the service in some namespaces

That worked for me. Thank you so much for sharing your solution!

For reference, my services are now looking like so:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: my-project
  namespace: my-namespace
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/scaleDownDelay: "5m"
    spec:
      <...>
      timeoutSeconds: 3600
      responseStartTimeoutSeconds: 3600 # < --- Relevant config

Copy link

This issue is stale because it has been open for 90 days with no
activity. It will automatically close after 30 more days of
inactivity. Reopen the issue with /reopen. Mark the issue as
fresh by adding the comment /remove-lifecycle stale.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 26, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Oct 27, 2024
@nkreiger
Copy link
Author

nkreiger commented Dec 11, 2024

Re-Opening this issue. I'm having issues with context canceled. Same as before, but now the timeout is, I believe 30 minutes.

~ » k get ksvc plugin-jira -n test -o yaml | grep responseStart
      responseStartTimeoutSeconds: 18500
~ » k get ksvc plugin-jira -n test -o yaml | grep timeout      
      timeoutSeconds: 18500
~ » k get cm -n knative-serving config-defaults -o yaml
apiVersion: v1
data:
  enable-service-links: "false"
  max-revision-timeout-seconds: "604800"
  revision-idle-timeout-seconds: "0"
  revision-response-start-timeout-seconds: "18500"
  revision-timeout-seconds: "18500"

There is no go-sdk timeout, I'm the one who made that PR into that project :), and I tested locally to verify the context canceled is not coming from that.

Activator Error

{"severity":"ERROR","timestamp":"2024-12-11T02:00:21.56191309Z","logger":"activator","caller":"network/error_handler.go:33","message":"error reverse proxying request; sockstat: sockets: used 555\nTCP: inuse 17 orphan 4 tw 52 alloc 1209 mem 137\nUDP: inuse 0 mem 0\nUDPLITE: inuse 0\nRAW: inuse 0\nFRAG: inuse 0 memory 0\n","commit":"038a181","knative.dev/controller":"activator","knative.dev/pod":"activator-7fd4c45679-9mrhf","knative.dev/key":"test/plugin-github-actions-00032","error":"context canceled","stacktrace":"knative.dev/pkg/network.ErrorHandler.func1\n\tknative.dev/[email protected]/network/error_handler.go:33\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:141\nnet/http/httputil.(*ReverseProxy).ServeHTTP\n\tnet/http/httputil/reverseproxy.go:472\nknative.dev/serving/pkg/activator/handler.(*activationHandler).proxyRequest\n\tknative.dev/serving/pkg/activator/handler/handler.go:144\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP.func1\n\tknative.dev/serving/pkg/activator/handler/handler.go:97\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try.func1\n\tknative.dev/serving/pkg/activator/net/throttler.go:237\nknative.dev/serving/pkg/activator/net.(*infiniteBreaker).Maybe\n\tknative.dev/serving/pkg/activator/net/throttler.go:754\nknative.dev/serving/pkg/activator/net.(*revisionThrottler).try\n\tknative.dev/serving/pkg/activator/net/throttler.go:227\nknative.dev/serving/pkg/activator/net.(*Throttler).Try\n\tknative.dev/serving/pkg/activator/net/throttler.go:522\nknative.dev/serving/pkg/activator/handler.(*activationHandler).ServeHTTP\n\tknative.dev/serving/pkg/activator/handler/handler.go:90\nknative.dev/serving/pkg/http/handler.(*timeoutHandler).ServeHTTP.func4\n\tknative.dev/serving/pkg/http/handler/timeout.go:118"}

I didn't recently upgrade, but this is the first time we've had 30+ minute processes.

Trigger Timeout:

spec:
  broker: default
  delivery:
    timeout: PT1H
  filter:
    attributes:
      source: ci.commithistory
      type: occurrence
  subscriber:
    ref:
      apiVersion: serving.knative.dev/v1
      kind: Service
      name: plugin-jira
      namespace: test

Activator

      serviceAccount: activator
      serviceAccountName: activator
      terminationGracePeriodSeconds: 604800

@nkreiger
Copy link
Author

nkreiger commented Dec 11, 2024

sorry to bother you @skonto could this issue be re-opened? I believe I've covered every area that should justify a timeout. Looking to see if anything is missing here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

2 participants