Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Consider timeoutSeconds for queue-proxy graceful shutdown #13953

Closed
kahirokunn opened this issue May 6, 2023 · 5 comments
Closed
Labels
kind/feature Well-understood/specified features, ready for coding.

Comments

@kahirokunn
Copy link
Member

Describe the feature

In Knative Service, the timeoutSeconds can be set to, for example, 10 minutes.

TimeoutSeconds *int64 `json:"timeoutSeconds,omitempty"`

However, when the queue-proxy receives a SIG_TERM, it terminates in 30 seconds regardless of the timeoutSeconds setting.

drainSleepDuration = 30 * time.Second

Additionally, the held TCP connections are forcibly closed at this time.

With this specification, Knative Serving requires each connection to complete within 30 seconds or less to achieve a graceful shutdown.
I'd like to enable more applications to run on Knative Serving by addressing this issue.

@kahirokunn kahirokunn added the kind/feature Well-understood/specified features, ready for coding. label May 6, 2023
@kahirokunn kahirokunn changed the title Feature Request: Consider timeoutSeconds for queue-proxy graceful shutdown Feature Request: Consider timeoutSeconds for queue-proxy graceful shutdown May 6, 2023
@dprotaso
Copy link
Member

dprotaso commented May 6, 2023

Additionally, the held TCP connections are forcibly closed at this time.

This should be fixed in v1.10.0 and v1.9.3 (#13815) - I realize it's missing in the release notes which I've just updated

However, when the queue-proxy receives a SIG_TERM, it terminates in 30 seconds regardless of the timeoutSeconds setting.

Technically we drain for 30 seconds but will reset that timer if there is a new incoming request. The maximum we will drain is set to the revision's timeout - we do this by setting the Pod's termination grace period like so

pod.TerminationGracePeriodSeconds = rev.Spec.TimeoutSeconds

So if there isn't a request after 30seconds we'll shutdown gracefully.

Are you seeing something different?

@kahirokunn
Copy link
Member Author

kahirokunn commented May 8, 2023

@dprotaso I am currently using v1.9.3 because Knative v1.10.0 has tls problems.

This should be fixed in v1.10.0 and v1.9.3 (#13815) - I realize it's missing in the release notes which I've just updated

I rechecked with v1.9.3 and the problem has not been resolved.

Reproduction procedure

  1. curl https://hello.xxxx
  2. kubectl delete po hello-xxxx
  3. The queue-proxy will exit in 30 seconds.

Complete source code used for reproduction.
http-hello-hard-limit.zip

const http = require("http");
const { setTimeout } = require("timers/promises");

const t = process.env.SLEEP && Number(process.env.SLEEP);
const port = process.env.PORT || 8080;
const msg = `Hello World ${process.env.SERVING_POD}`;

console.log(`
SLEEP: ${process.env.SLEEP}
SERVING_POD: ${process.env.SERVING_POD}
PORT: ${port}
msg: ${msg}
`);

let count = 0;
const server = http
  .createServer(async (req, res) => {
    console.log("received request");
    count ++;
    if (t) {
      await setTimeout(t);
    }

    res.write(`${msg}, concurrency: ${count}`);
    res.end();
    count--;
  })
  .listen(port);
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: hello
  namespace: default
spec:
  template:
    metadata:
    spec:
      timeoutSeconds: 65
      containers:
      - env:
        - name: SLEEP
          value: "60000"
        - name: SERVING_POD
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        image: kahiro/hello-world-node:latest
        name: hello-container
        ports:
        - containerPort: 8080
          protocol: TCP
  traffic:
  - latestRevision: true
    percent: 100

The content is simple: just sleep the http request for 60 seconds and return it.
timeoutSeconds is 65 seconds.
However, queue-proxy terminates in 30 seconds.

@dprotaso
Copy link
Member

dprotaso commented May 8, 2023

I'm having trouble reproducing what you're describing with 1.9.3 and 1.10

  1. I deploy your server and wait for it to scale down.
  2. I make a request (curl) and wait to see the pod running
  3. While the request is waiting I delete the pod kubectl delete pod [podname]
  4. Wait
  5. I see a new pod come up
  6. I see the old pod queue proxy readiness fail (which is expected)
  7. I see the request continue to hang (as expected)
  8. I see the response from the server
  9. I see the queue proxy exit
  10. I see the user-container hang (there's no graceful shutdown)
  11. I see K8s kill the user-container

note: the node code doesn't shutdown gracefully - you can accomplish that by adding and the code below and it will shutdown the server on a TERM signal

process.on('SIGTERM', () => {
  console.info('SIGTERM signal received.');
  server.close(() => {
    console.log('Http server closed.');
  });
});

@dprotaso
Copy link
Member

dprotaso commented May 8, 2023

I tried to bump the timeout to 2min and I see the queue proxy remain alive for that long

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
  namespace: default
spec:
  template:
    spec:
      timeoutSeconds: 130
      containers:
        - image: dprotaso/hello-world-node:latest # handles SIGTERM
          env:
            - name: TARGET
              value: "Go Sample v1"
            - name: SLEEP
              value: "120000"

@kahirokunn
Copy link
Member Author

@dprotaso
You were right.
My environment was 1.9.2, not 1.9.3.
I updated it and sure enough, you were right.
Sorry. Thx. 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Well-understood/specified features, ready for coding.
Projects
None yet
Development

No branches or pull requests

2 participants