-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
try to start shutdown without always waiting for a specific time #2365
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lvjing2 If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lvjing2: 1 warning.
In response to this:
Fixes #
Currently, the queue-proxy and user-container had to wait to terminate until timeout.Proposed Changes
Try to check the concurrency of user-container, is there is no request, then start to terminated, so it need't to wait for a specific 20 second.
This can help to get rid of settingquitSleepSecs
in the future, we can just check the concurrency number to decide when to start terminating cause the request has it's own timeout setting.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
@@ -129,3 +129,7 @@ func NewStats(podName string, channels Channels, startedAt time.Time) *Stats { | |||
|
|||
return s | |||
} | |||
|
|||
func (s *Stats) GetConcurrency() int32 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Golint comments: exported method Stats.GetConcurrency should have comment or be unexported. More info.
/ok-to-test |
How does this relate to the graceful shutdown of the HTTP server itself? https://github.com/knative/serving/blob/master/cmd/queue/main.go#L281-L284 these lines read like exactly what we need here. |
cmd/queue/main.go
Outdated
CheckAllDone: | ||
for { | ||
select { | ||
case <-timer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe you can just use <-time.After(quitSleepSecs * time.Second)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice code, thanks.
Hi @markusthoemmes From my test, the queue-proxy will still terminated in a short time once it's received the SIGTERM, even there are active connections.
go func() {
<-sigTermChan
// Calling server.Shutdown() allows pending requests to
// complete, while no new work is accepted.
logger.Info("start to terminating the queue-proxy")
server.Shutdown(context.Background())
adminServer.Shutdown(context.Background())
if statSink != nil {
statSink.Close()
}
logger.Info("end to terminating the queue-proxy")
os.Exit(0)
}()
Form the log of step3, the queue-proxy will terminate immediately, and the connection is canceled. so we can't reply on the gracefully shutdown of go http server. After #2357 , the istio-proxy, queue-proxy, user-container will wait after 20s. but once 20s passed, the terminating order of this three container is still istio-proxy -> queue-proxy -> user-container in most case, so the request remained in user-container will not return to the user anymore, and then the istio will try to reroute this request to any other alive pod. This PR may not to resolve all the problem when gracefully shutdown describe above, this pr is try to terminate the pod without waiting for 20s each time. I will change the title of this PR, thanks for pointing this out. |
/hold |
/hold cancel |
The following is the coverage report on pkg/.
|
/hold cancel |
Still I'd like to further clarify why the golang http server does not behave as advertised! We might have another issue here. I'm digging into this right now. |
@markusthoemmes Yes, I am also very curious about why it doesn't behave well. I am very glad if there is anything that I can help. |
So I've assembled a small example that works just as I'd expect locally: package main
import (
"context"
"fmt"
"net/http"
"os"
"os/signal"
"syscall"
"time"
"github.com/knative/serving/pkg/http/h2c"
)
func handler(w http.ResponseWriter, r *http.Request) {
time.Sleep(30 * time.Second)
fmt.Fprint(w, "Hi there")
}
func main() {
server := h2c.NewServer(
fmt.Sprintf(":%d", 8080),
http.HandlerFunc(handler),
)
go server.ListenAndServe()
sigTermChan := make(chan os.Signal)
signal.Notify(sigTermChan, syscall.SIGTERM)
<-sigTermChan
fmt.Println("received SIGTERM signal, shutting down connections")
// Calling server.Shutdown() allows pending requests to
// complete, while no new work is accepted.
server.Shutdown(context.Background())
fmt.Println("connections have been shut down")
} If you run this and |
@markusthoemmes Hi, I change the quitSleepSecs to 0, and with #2357 , what's more, I am also get rid of the problem of From your test, the golang http server behave well for it only handle it's own request. but for the queue-proxy, it need to handle the proxy connection to user-container, so this problem happened, It can also be test locally. |
@markusthoemmes Hi, I had split the problems we described into #2388 (The reason to name as |
@lvjing2 fair point, I'll give a proxying server a shot as well. |
Using this minimal proxy in conjunction with the code above works just as expected as well: package main
import (
"context"
"fmt"
"net/http"
"net/http/httputil"
"net/url"
"os"
"os/signal"
"syscall"
"github.com/knative/serving/pkg/http/h2c"
)
var target *url.URL
var proxy *httputil.ReverseProxy
func handler(w http.ResponseWriter, r *http.Request) {
fmt.Println("Proxying request")
proxy.ServeHTTP(w, r)
}
func main() {
target, _ = url.Parse("http://localhost:8080")
proxy = httputil.NewSingleHostReverseProxy(target)
server := h2c.NewServer(
fmt.Sprintf(":%d", 9090),
http.HandlerFunc(handler),
)
go server.ListenAndServe()
sigTermChan := make(chan os.Signal)
signal.Notify(sigTermChan, syscall.SIGTERM)
<-sigTermChan
fmt.Println("received SIGTERM signal, shutting down connections")
// Calling server.Shutdown() allows pending requests to
// complete, while no new work is accepted.
server.Shutdown(context.Background())
fmt.Println("connections have been shut down")
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a red herring we're following here. The golang servers are handling graceful shutdown already and the timeout you're eliminating needs to stay in place (for now) I think to prevent intermittent failures.
break CheckAllDone | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't do this here I think. The reason for the "arbitrary" sleep is removal of the pod from the service, not graceful shutdown of the server. In theory, we keep accepting requests for a short amount of time to prevent dropped requests in the time-period of shutting down and actually being removed from the (Virtual)-Service.
Thanks for the help from @markusthoemmes , it is necessary to sleep for some time, so this pr is needn't and I will close it. |
Fixes #
Currently, the queue-proxy and user-container had to wait for a specific 20s of timeout to start terminating.
Proposed Changes
Try to check the concurrency of user-container, is there is no request, then start to terminated, so it need't to wait for a specific 20 second.
This can help to get rid of setting
quitSleepSecs
in the future, we can just check the concurrency number to decide when to start terminating cause the request has it's own timeout setting.