try to start shutdown without always waiting for a specific time #2365

lvjing2 · 2018-10-31T16:24:16Z

Fixes #
Currently, the queue-proxy and user-container had to wait for a specific 20s of timeout to start terminating.

Proposed Changes

Try to check the concurrency of user-container, is there is no request, then start to terminated, so it need't to wait for a specific 20 second.
This can help to get rid of setting quitSleepSecs in the future, we can just check the concurrency number to decide when to start terminating cause the request has it's own timeout setting.

knative-prow-robot · 2018-10-31T16:24:27Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lvjing2
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mdemirhan

If they are not already assigned, you can assign the PR to them by writing /assign @mdemirhan in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

cmd/queue/OWNERS
pkg/queue/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

knative-prow-robot

@lvjing2: 1 warning.

In response to this:

Fixes #
Currently, the queue-proxy and user-container had to wait to terminate until timeout.

Proposed Changes

Try to check the concurrency of user-container, is there is no request, then start to terminated, so it need't to wait for a specific 20 second.
This can help to get rid of setting quitSleepSecs in the future, we can just check the concurrency number to decide when to start terminating cause the request has it's own timeout setting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

knative-prow-robot · 2018-10-31T16:24:28Z

pkg/queue/stats.go

@@ -129,3 +129,7 @@ func NewStats(podName string, channels Channels, startedAt time.Time) *Stats {

 	return s
 }
+
+func (s *Stats) GetConcurrency() int32 {


Golint comments: exported method Stats.GetConcurrency should have comment or be unexported. More info.

tcnghia · 2018-10-31T16:37:07Z

/ok-to-test

markusthoemmes · 2018-10-31T16:59:11Z

How does this relate to the graceful shutdown of the HTTP server itself?

https://github.com/knative/serving/blob/master/cmd/queue/main.go#L281-L284 these lines read like exactly what we need here.

mattmoor · 2018-11-01T18:53:06Z

cmd/queue/main.go

+CheckAllDone:
+	for {
+		select {
+		case <-timer:


I believe you can just use <-time.After(quitSleepSecs * time.Second)

This is a nice code, thanks.

lvjing2 · 2018-11-02T03:25:53Z

How does this relate to the graceful shutdown of the HTTP server itself?

https://github.com/knative/serving/blob/master/cmd/queue/main.go#L281-L284 these lines read like exactly what we need here.

Hi @markusthoemmes From my test, the queue-proxy will still terminated in a short time once it's received the SIGTERM, even there are active connections.
The test step:

modify the terminating goroutine

go func() {
		<-sigTermChan
		// Calling server.Shutdown() allows pending requests to
		// complete, while no new work is accepted.

		logger.Info("start to terminating the queue-proxy")
		server.Shutdown(context.Background())
		adminServer.Shutdown(context.Background())

		if statSink != nil {
			statSink.Close()
		}

		logger.Info("end to terminating the queue-proxy")
		os.Exit(0)
	}()

send a request to the pod, and the request will start to sleep a long time
check the log of queue-proxy several times, and find the last log before it terminated

{"level":"info","ts":"2018-11-02T03:10:04.591Z","logger":"queueproxy","caller":"queue/main.go:301","msg":"start to terminating the queue-proxy","knative.dev/key":"knative-serving/helloworld-go-00001","knative.dev/pod":"helloworld-go-00001-deployment-586997866c-wqt4d"}
2018/11/02 03:10:04 http: proxy error: context canceled
{"level":"info","ts":"2018-11-02T03:10:05.091Z","logger":"queueproxy","caller":"queue/main.go:309","msg":"end to terminating the queue-proxy","knative.dev/key":"knative-serving/helloworld-go-00001","knative.dev/pod":"helloworld-go-00001-deployment-586997866c-wqt4d"}

Form the log of step3, the queue-proxy will terminate immediately, and the connection is canceled. so we can't reply on the gracefully shutdown of go http server.

After #2357 , the istio-proxy, queue-proxy, user-container will wait after 20s. but once 20s passed, the terminating order of this three container is still istio-proxy -> queue-proxy -> user-container in most case, so the request remained in user-container will not return to the user anymore, and then the istio will try to reroute this request to any other alive pod.

This PR may not to resolve all the problem when gracefully shutdown describe above, this pr is try to terminate the pod without waiting for 20s each time. I will change the title of this PR, thanks for pointing this out.

lvjing2 · 2018-11-02T09:14:47Z

/hold

lvjing2 · 2018-11-02T09:25:31Z

/hold cancel

knative-metrics-robot · 2018-11-02T09:28:10Z

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
pkg/queue/stats.go	100.0%	96.9%	-3.1

lvjing2 · 2018-11-02T09:29:39Z

/hold cancel

markusthoemmes · 2018-11-02T11:10:01Z

Still I'd like to further clarify why the golang http server does not behave as advertised! We might have another issue here. I'm digging into this right now.

lvjing2 · 2018-11-02T14:11:07Z

@markusthoemmes Yes, I am also very curious about why it doesn't behave well. I am very glad if there is anything that I can help.

markusthoemmes · 2018-11-02T16:05:50Z

So I've assembled a small example that works just as I'd expect locally:

package main

import (
	"context"
	"fmt"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/knative/serving/pkg/http/h2c"
)

func handler(w http.ResponseWriter, r *http.Request) {
	time.Sleep(30 * time.Second)
	fmt.Fprint(w, "Hi there")
}

func main() {
	server := h2c.NewServer(
		fmt.Sprintf(":%d", 8080),
		http.HandlerFunc(handler),
	)

	go server.ListenAndServe()

	sigTermChan := make(chan os.Signal)
	signal.Notify(sigTermChan, syscall.SIGTERM)
	<-sigTermChan

	fmt.Println("received SIGTERM signal, shutting down connections")
	// Calling server.Shutdown() allows pending requests to
	// complete, while no new work is accepted.
	server.Shutdown(context.Background())
	fmt.Println("connections have been shut down")
}

If you run this and kill -TERM the process, the connection is let alone until it actually shuts down. I'm wondering why that isn't working in this case, maybe the istio proxy is shutting down prematurely, causing the connection to drop immediately?

lvjing2 · 2018-11-04T10:57:28Z

@markusthoemmes Hi, I change the quitSleepSecs to 0, and with #2357 , what's more, I am also get rid of the problem of pid=1 which is described as problem 1 in #2388 . In this case, the queue-proxy and user-container will received the SIGTERM at almost the same time before the istio-proxy. From my test, the queue-proxy would also terminated when there are connections to user-container.

From your test, the golang http server behave well for it only handle it's own request. but for the queue-proxy, it need to handle the proxy connection to user-container, so this problem happened, It can also be test locally.

lvjing2 · 2018-11-05T02:55:51Z

@markusthoemmes Hi, I had split the problems we described into #2388 (The reason to name as Make sure one request should only be executed once is the problems we discussed here would cause one request be routed to more than one pod and be executed more than once). I suggest we can continue the discussion in the new issues. And for this PR, we can come back to it's topic start shutdown without always waiting for a specific time, how do you think?

markusthoemmes · 2018-11-05T07:05:37Z

@lvjing2 fair point, I'll give a proxying server a shot as well.

markusthoemmes · 2018-11-05T11:05:14Z

Using this minimal proxy in conjunction with the code above works just as expected as well:

package main

import (
	"context"
	"fmt"
	"net/http"
	"net/http/httputil"
	"net/url"
	"os"
	"os/signal"
	"syscall"

	"github.com/knative/serving/pkg/http/h2c"
)

var target *url.URL
var proxy *httputil.ReverseProxy

func handler(w http.ResponseWriter, r *http.Request) {
	fmt.Println("Proxying request")
	proxy.ServeHTTP(w, r)
}

func main() {
	target, _ = url.Parse("http://localhost:8080")
	proxy = httputil.NewSingleHostReverseProxy(target)

	server := h2c.NewServer(
		fmt.Sprintf(":%d", 9090),
		http.HandlerFunc(handler),
	)

	go server.ListenAndServe()

	sigTermChan := make(chan os.Signal)
	signal.Notify(sigTermChan, syscall.SIGTERM)
	<-sigTermChan

	fmt.Println("received SIGTERM signal, shutting down connections")
	// Calling server.Shutdown() allows pending requests to
	// complete, while no new work is accepted.
	server.Shutdown(context.Background())
	fmt.Println("connections have been shut down")
}

markusthoemmes

I think this is a red herring we're following here. The golang servers are handling graceful shutdown already and the timeout you're eliminating needs to stay in place (for now) I think to prevent intermittent failures.

markusthoemmes · 2018-11-05T11:18:28Z

cmd/queue/main.go

+				break CheckAllDone
+			}
+		}
+	}


We shouldn't do this here I think. The reason for the "arbitrary" sleep is removal of the pod from the service, not graceful shutdown of the server. In theory, we keep accepting requests for a short amount of time to prevent dropped requests in the time-period of shutting down and actually being removed from the (Virtual)-Service.

lvjing2 · 2018-11-05T13:13:14Z

Thanks for the help from @markusthoemmes , it is necessary to sleep for some time, so this pr is needn't and I will close it.

try gracefully shutdown without waitting for a specific time

dd7d3a5

knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 31, 2018

knative-prow-robot reviewed Oct 31, 2018

View reviewed changes

knative-prow-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2018

knative-prow-robot requested review from josephburnett and mdemirhan October 31, 2018 16:24

add comment

2a453db

knative-prow-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2018

mattmoor reviewed Nov 1, 2018

View reviewed changes

lvjing2 changed the title ~~try gracefully shutdown without waiting for a specific time~~ try to start shutdown without always waiting for a specific time Nov 2, 2018

lvjing2 added 2 commits November 2, 2018 11:35

simplify code

bb012b4

write content after setting header

3e76bd7

knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2018

fix loop timeout

611a224

knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2018

mattmoor assigned markusthoemmes Nov 4, 2018

markusthoemmes suggested changes Nov 5, 2018

View reviewed changes

lvjing2 closed this Nov 5, 2018

markusthoemmes mentioned this pull request Nov 5, 2018

Refactor shutdown logic in queue-proxy. #2406

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

try to start shutdown without always waiting for a specific time #2365

try to start shutdown without always waiting for a specific time #2365

lvjing2 commented Oct 31, 2018 •

edited

Loading

knative-prow-robot commented Oct 31, 2018

knative-prow-robot left a comment

knative-prow-robot Oct 31, 2018

tcnghia commented Oct 31, 2018

markusthoemmes commented Oct 31, 2018

mattmoor Nov 1, 2018

lvjing2 Nov 2, 2018

lvjing2 commented Nov 2, 2018

lvjing2 commented Nov 2, 2018

lvjing2 commented Nov 2, 2018 •

edited

Loading

knative-metrics-robot commented Nov 2, 2018

lvjing2 commented Nov 2, 2018

markusthoemmes commented Nov 2, 2018

lvjing2 commented Nov 2, 2018

markusthoemmes commented Nov 2, 2018

lvjing2 commented Nov 4, 2018 •

edited

Loading

lvjing2 commented Nov 5, 2018

markusthoemmes commented Nov 5, 2018

markusthoemmes commented Nov 5, 2018

markusthoemmes left a comment

markusthoemmes Nov 5, 2018

lvjing2 commented Nov 5, 2018

try to start shutdown without always waiting for a specific time #2365

try to start shutdown without always waiting for a specific time #2365

Conversation

lvjing2 commented Oct 31, 2018 • edited Loading

Proposed Changes

knative-prow-robot commented Oct 31, 2018

knative-prow-robot left a comment

Choose a reason for hiding this comment

Proposed Changes

knative-prow-robot Oct 31, 2018

Choose a reason for hiding this comment

tcnghia commented Oct 31, 2018

markusthoemmes commented Oct 31, 2018

mattmoor Nov 1, 2018

Choose a reason for hiding this comment

lvjing2 Nov 2, 2018

Choose a reason for hiding this comment

lvjing2 commented Nov 2, 2018

lvjing2 commented Nov 2, 2018

lvjing2 commented Nov 2, 2018 • edited Loading

knative-metrics-robot commented Nov 2, 2018

lvjing2 commented Nov 2, 2018

markusthoemmes commented Nov 2, 2018

lvjing2 commented Nov 2, 2018

markusthoemmes commented Nov 2, 2018

lvjing2 commented Nov 4, 2018 • edited Loading

lvjing2 commented Nov 5, 2018

markusthoemmes commented Nov 5, 2018

markusthoemmes commented Nov 5, 2018

markusthoemmes left a comment

Choose a reason for hiding this comment

markusthoemmes Nov 5, 2018

Choose a reason for hiding this comment

lvjing2 commented Nov 5, 2018

lvjing2 commented Oct 31, 2018 •

edited

Loading

lvjing2 commented Nov 2, 2018 •

edited

Loading

lvjing2 commented Nov 4, 2018 •

edited

Loading