Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

try to start shutdown without always waiting for a specific time #2365

Closed
wants to merge 5 commits into from

Conversation

lvjing2
Copy link
Contributor

@lvjing2 lvjing2 commented Oct 31, 2018

Fixes #
Currently, the queue-proxy and user-container had to wait for a specific 20s of timeout to start terminating.

Proposed Changes

Try to check the concurrency of user-container, is there is no request, then start to terminated, so it need't to wait for a specific 20 second.
This can help to get rid of setting quitSleepSecs in the future, we can just check the concurrency number to decide when to start terminating cause the request has it's own timeout setting.

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lvjing2
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: mdemirhan

If they are not already assigned, you can assign the PR to them by writing /assign @mdemirhan in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Oct 31, 2018
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lvjing2: 1 warning.

In response to this:

Fixes #
Currently, the queue-proxy and user-container had to wait to terminate until timeout.

Proposed Changes

Try to check the concurrency of user-container, is there is no request, then start to terminated, so it need't to wait for a specific 20 second.
This can help to get rid of setting quitSleepSecs in the future, we can just check the concurrency number to decide when to start terminating cause the request has it's own timeout setting.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -129,3 +129,7 @@ func NewStats(podName string, channels Channels, startedAt time.Time) *Stats {

return s
}

func (s *Stats) GetConcurrency() int32 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported method Stats.GetConcurrency should have comment or be unexported. More info.

@knative-prow-robot knative-prow-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2018
@tcnghia
Copy link
Contributor

tcnghia commented Oct 31, 2018

/ok-to-test

@knative-prow-robot knative-prow-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 31, 2018
@markusthoemmes
Copy link
Contributor

How does this relate to the graceful shutdown of the HTTP server itself?

https://github.com/knative/serving/blob/master/cmd/queue/main.go#L281-L284 these lines read like exactly what we need here.

CheckAllDone:
for {
select {
case <-timer:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe you can just use <-time.After(quitSleepSecs * time.Second)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice code, thanks.

@lvjing2
Copy link
Contributor Author

lvjing2 commented Nov 2, 2018

How does this relate to the graceful shutdown of the HTTP server itself?

https://github.com/knative/serving/blob/master/cmd/queue/main.go#L281-L284 these lines read like exactly what we need here.

Hi @markusthoemmes From my test, the queue-proxy will still terminated in a short time once it's received the SIGTERM, even there are active connections.
The test step:

  1. modify the terminating goroutine
go func() {
		<-sigTermChan
		// Calling server.Shutdown() allows pending requests to
		// complete, while no new work is accepted.

		logger.Info("start to terminating the queue-proxy")
		server.Shutdown(context.Background())
		adminServer.Shutdown(context.Background())

		if statSink != nil {
			statSink.Close()
		}

		logger.Info("end to terminating the queue-proxy")
		os.Exit(0)
	}()
  1. send a request to the pod, and the request will start to sleep a long time
  2. check the log of queue-proxy several times, and find the last log before it terminated
{"level":"info","ts":"2018-11-02T03:10:04.591Z","logger":"queueproxy","caller":"queue/main.go:301","msg":"start to terminating the queue-proxy","knative.dev/key":"knative-serving/helloworld-go-00001","knative.dev/pod":"helloworld-go-00001-deployment-586997866c-wqt4d"}
2018/11/02 03:10:04 http: proxy error: context canceled
{"level":"info","ts":"2018-11-02T03:10:05.091Z","logger":"queueproxy","caller":"queue/main.go:309","msg":"end to terminating the queue-proxy","knative.dev/key":"knative-serving/helloworld-go-00001","knative.dev/pod":"helloworld-go-00001-deployment-586997866c-wqt4d"}

Form the log of step3, the queue-proxy will terminate immediately, and the connection is canceled. so we can't reply on the gracefully shutdown of go http server.

After #2357 , the istio-proxy, queue-proxy, user-container will wait after 20s. but once 20s passed, the terminating order of this three container is still istio-proxy -> queue-proxy -> user-container in most case, so the request remained in user-container will not return to the user anymore, and then the istio will try to reroute this request to any other alive pod.

This PR may not to resolve all the problem when gracefully shutdown describe above, this pr is try to terminate the pod without waiting for 20s each time. I will change the title of this PR, thanks for pointing this out.

@lvjing2 lvjing2 changed the title try gracefully shutdown without waiting for a specific time try to start shutdown without always waiting for a specific time Nov 2, 2018
@lvjing2
Copy link
Contributor Author

lvjing2 commented Nov 2, 2018

/hold

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2018
@lvjing2
Copy link
Contributor Author

lvjing2 commented Nov 2, 2018

/hold cancel

@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/queue/stats.go 100.0% 96.9% -3.1

@lvjing2
Copy link
Contributor Author

lvjing2 commented Nov 2, 2018

/hold cancel

@knative-prow-robot knative-prow-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 2, 2018
@markusthoemmes
Copy link
Contributor

Still I'd like to further clarify why the golang http server does not behave as advertised! We might have another issue here. I'm digging into this right now.

@lvjing2
Copy link
Contributor Author

lvjing2 commented Nov 2, 2018

@markusthoemmes Yes, I am also very curious about why it doesn't behave well. I am very glad if there is anything that I can help.

@markusthoemmes
Copy link
Contributor

So I've assembled a small example that works just as I'd expect locally:

package main

import (
	"context"
	"fmt"
	"net/http"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/knative/serving/pkg/http/h2c"
)

func handler(w http.ResponseWriter, r *http.Request) {
	time.Sleep(30 * time.Second)
	fmt.Fprint(w, "Hi there")
}

func main() {
	server := h2c.NewServer(
		fmt.Sprintf(":%d", 8080),
		http.HandlerFunc(handler),
	)

	go server.ListenAndServe()

	sigTermChan := make(chan os.Signal)
	signal.Notify(sigTermChan, syscall.SIGTERM)
	<-sigTermChan

	fmt.Println("received SIGTERM signal, shutting down connections")
	// Calling server.Shutdown() allows pending requests to
	// complete, while no new work is accepted.
	server.Shutdown(context.Background())
	fmt.Println("connections have been shut down")
}

If you run this and kill -TERM the process, the connection is let alone until it actually shuts down. I'm wondering why that isn't working in this case, maybe the istio proxy is shutting down prematurely, causing the connection to drop immediately?

@lvjing2
Copy link
Contributor Author

lvjing2 commented Nov 4, 2018

@markusthoemmes Hi, I change the quitSleepSecs to 0, and with #2357 , what's more, I am also get rid of the problem of pid=1 which is described as problem 1 in #2388 . In this case, the queue-proxy and user-container will received the SIGTERM at almost the same time before the istio-proxy. From my test, the queue-proxy would also terminated when there are connections to user-container.

From your test, the golang http server behave well for it only handle it's own request. but for the queue-proxy, it need to handle the proxy connection to user-container, so this problem happened, It can also be test locally.

@lvjing2
Copy link
Contributor Author

lvjing2 commented Nov 5, 2018

@markusthoemmes Hi, I had split the problems we described into #2388 (The reason to name as Make sure one request should only be executed once is the problems we discussed here would cause one request be routed to more than one pod and be executed more than once). I suggest we can continue the discussion in the new issues. And for this PR, we can come back to it's topic start shutdown without always waiting for a specific time, how do you think?

@markusthoemmes
Copy link
Contributor

@lvjing2 fair point, I'll give a proxying server a shot as well.

@markusthoemmes
Copy link
Contributor

Using this minimal proxy in conjunction with the code above works just as expected as well:

package main

import (
	"context"
	"fmt"
	"net/http"
	"net/http/httputil"
	"net/url"
	"os"
	"os/signal"
	"syscall"

	"github.com/knative/serving/pkg/http/h2c"
)

var target *url.URL
var proxy *httputil.ReverseProxy

func handler(w http.ResponseWriter, r *http.Request) {
	fmt.Println("Proxying request")
	proxy.ServeHTTP(w, r)
}

func main() {
	target, _ = url.Parse("http://localhost:8080")
	proxy = httputil.NewSingleHostReverseProxy(target)

	server := h2c.NewServer(
		fmt.Sprintf(":%d", 9090),
		http.HandlerFunc(handler),
	)

	go server.ListenAndServe()

	sigTermChan := make(chan os.Signal)
	signal.Notify(sigTermChan, syscall.SIGTERM)
	<-sigTermChan

	fmt.Println("received SIGTERM signal, shutting down connections")
	// Calling server.Shutdown() allows pending requests to
	// complete, while no new work is accepted.
	server.Shutdown(context.Background())
	fmt.Println("connections have been shut down")
}

Copy link
Contributor

@markusthoemmes markusthoemmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a red herring we're following here. The golang servers are handling graceful shutdown already and the timeout you're eliminating needs to stay in place (for now) I think to prevent intermittent failures.

break CheckAllDone
}
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't do this here I think. The reason for the "arbitrary" sleep is removal of the pod from the service, not graceful shutdown of the server. In theory, we keep accepting requests for a short amount of time to prevent dropped requests in the time-period of shutting down and actually being removed from the (Virtual)-Service.

@lvjing2
Copy link
Contributor Author

lvjing2 commented Nov 5, 2018

Thanks for the help from @markusthoemmes , it is necessary to sleep for some time, so this pr is needn't and I will close it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants