//knative/serving/test/e2e:TestAutoscaleUpDownUp is super flaky #2351

adrcunha · 2018-10-30T17:30:27Z

Expected Behavior

//knative/serving/test/e2e:TestAutoscaleUpDownUp flakiness is close to 0.

Actual Behavior

For the last 16 CI runs, TestAutoscaleUpDownUp failed 10 times, or ~60% of the time. From these 10 failures, 9 were because ("got/wanted" numbers change for each failure)

autoscale_test.go:212: Error during initial scale up: Error making requests for scale up. Got 160 successful requests. Wanted 162.

https://gubernator.knative.dev/build/knative-prow/logs/ci-knative-serving-continuous/1057301584175697921

Steps to Reproduce the Problem

Run the test a few times.

Additional Info

The text was updated successfully, but these errors were encountered:

mattmoor · 2018-10-30T17:37:24Z

/assign @dgerd

@josephburnett Dan volunteered to help a bit here.

tcnghia · 2018-10-31T00:13:48Z

The autoscaling issue seems to be caused by requests hitting Terminating pods, in which istio-proxy usually dies first causing the Terminating pods to reject all message. that results in 503 upstream reset/disconnect before header. we used to have a prestop sleep like this tcnghia@20ca8e5#diff-d4961cee95b4d08627915faf1c62c4d3L1114, but later we could avoid doing that by adding retries. however, we are seeing the same issue again now.

I think we should add back the prestop sleep to unblock PRs, while working on a better fix.

tcnghia · 2018-10-31T00:14:06Z

/assign @tcnghia

josephburnett · 2018-10-31T00:18:02Z

One other action item might be to create another E2E test that just serves variable traffic over time. That would help narrow in on the cause more quickly I think. Up Down and Up is a lot to debug all at once.

tcnghia · 2018-10-31T00:28:00Z

@dgerd has an excellent way to repro this consistently by killing the revision's Pod. I think we should add that test.

lvjing2 · 2018-10-31T04:17:55Z

Thanks for your help, this also fixed part of problems in #2311 and #2344

lvjing2 · 2018-11-06T12:29:07Z

Hi, is this problem still there? if yes, then I'd like try to dig it also.

adrcunha · 2018-11-06T17:55:26Z

Actually it's way better now: in the last 24 runs, it only failed once. I'm closing this issue, thanks everyone.

adrcunha assigned josephburnett Oct 30, 2018

knative-prow-robot added area/autoscale area/test-and-release It flags unit/e2e/conformance/perf test issues for product features kind/bug Categorizes issue or PR as related to a bug. labels Oct 30, 2018

adrcunha mentioned this issue Oct 30, 2018

Route e2e tests have started to flake. #2115

Closed

knative-prow-robot assigned dgerd Oct 30, 2018

tcnghia self-assigned this Oct 31, 2018

tcnghia mentioned this issue Oct 31, 2018

Add a prestop sleep to avoid istio-proxy terminating too quickly. #2357

Merged

lvjing2 mentioned this issue Nov 6, 2018

set the ratio min=0.0 #2411

Closed

adrcunha closed this as completed Nov 6, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

//knative/serving/test/e2e:TestAutoscaleUpDownUp is super flaky #2351

//knative/serving/test/e2e:TestAutoscaleUpDownUp is super flaky #2351

adrcunha commented Oct 30, 2018

mattmoor commented Oct 30, 2018

tcnghia commented Oct 31, 2018

tcnghia commented Oct 31, 2018

josephburnett commented Oct 31, 2018

tcnghia commented Oct 31, 2018

lvjing2 commented Oct 31, 2018

lvjing2 commented Nov 6, 2018 •

edited

Loading

adrcunha commented Nov 6, 2018

//knative/serving/test/e2e:TestAutoscaleUpDownUp is super flaky #2351

//knative/serving/test/e2e:TestAutoscaleUpDownUp is super flaky #2351

Comments

adrcunha commented Oct 30, 2018

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

Additional Info

mattmoor commented Oct 30, 2018

tcnghia commented Oct 31, 2018

tcnghia commented Oct 31, 2018

josephburnett commented Oct 31, 2018

tcnghia commented Oct 31, 2018

lvjing2 commented Oct 31, 2018

lvjing2 commented Nov 6, 2018 • edited Loading

adrcunha commented Nov 6, 2018

lvjing2 commented Nov 6, 2018 •

edited

Loading