Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

//knative/serving/test/e2e:TestAutoscaleUpDownUp is super flaky #2351

Closed
adrcunha opened this issue Oct 30, 2018 · 8 comments
Closed

//knative/serving/test/e2e:TestAutoscaleUpDownUp is super flaky #2351

adrcunha opened this issue Oct 30, 2018 · 8 comments
Assignees
Labels
area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug.

Comments

@adrcunha
Copy link
Contributor

Expected Behavior

//knative/serving/test/e2e:TestAutoscaleUpDownUp flakiness is close to 0.

Actual Behavior

For the last 16 CI runs, TestAutoscaleUpDownUp failed 10 times, or ~60% of the time. From these 10 failures, 9 were because ("got/wanted" numbers change for each failure)

autoscale_test.go:212: Error during initial scale up: Error making requests for scale up. Got 160 successful requests. Wanted 162.

https://gubernator.knative.dev/build/knative-prow/logs/ci-knative-serving-continuous/1057301584175697921

Steps to Reproduce the Problem

  1. Run the test a few times.

Additional Info

image

@knative-prow-robot knative-prow-robot added area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. labels Oct 30, 2018
@mattmoor
Copy link
Member

/assign @dgerd

@josephburnett Dan volunteered to help a bit here.

@tcnghia
Copy link
Contributor

tcnghia commented Oct 31, 2018

The autoscaling issue seems to be caused by requests hitting Terminating pods, in which istio-proxy usually dies first causing the Terminating pods to reject all message. that results in 503 upstream reset/disconnect before header. we used to have a prestop sleep like this tcnghia@20ca8e5#diff-d4961cee95b4d08627915faf1c62c4d3L1114, but later we could avoid doing that by adding retries. however, we are seeing the same issue again now.

I think we should add back the prestop sleep to unblock PRs, while working on a better fix.

@tcnghia
Copy link
Contributor

tcnghia commented Oct 31, 2018

/assign @tcnghia

@josephburnett
Copy link
Contributor

One other action item might be to create another E2E test that just serves variable traffic over time. That would help narrow in on the cause more quickly I think. Up Down and Up is a lot to debug all at once.

@tcnghia
Copy link
Contributor

tcnghia commented Oct 31, 2018

@dgerd has an excellent way to repro this consistently by killing the revision's Pod. I think we should add that test.

@lvjing2
Copy link
Contributor

lvjing2 commented Oct 31, 2018

Thanks for your help, this also fixed part of problems in #2311 and #2344

@lvjing2
Copy link
Contributor

lvjing2 commented Nov 6, 2018

Hi, is this problem still there? if yes, then I'd like try to dig it also.

@adrcunha
Copy link
Contributor Author

adrcunha commented Nov 6, 2018

Actually it's way better now: in the last 24 runs, it only failed once. I'm closing this issue, thanks everyone.

@adrcunha adrcunha closed this as completed Nov 6, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

7 participants