Fix unstable TestAgent_ReloadConfigAndKeepChecksStatus #7426

pierresouchay · 2020-03-10T16:36:54Z

On my machine, with some load, TestAgent_ReloadConfigAndKeepChecksStatus is failing 1/3 of time
because interval is low (1s). Remove the first test for critical as it is tested elsewhere.

Setting the interval to 3s of interval also fixes the issue but slow down all tests

This will avoid test to be unstable and fix #7425.

test protocal on my machine to ensure it is not a bigger problem:

while true
do
  go clean -testcache
  go test -timeout 30s github.com/hashicorp/consul/agent -run '^TestAgent_ReloadConfigAndKeepChecksStatus$'
done

=> with 1s interval => fails 1/3 of time
=> with 3s interval => never fails

On my machine, with some load, TestAgent_ReloadConfigAndKeepChecksStatus is failing 1/3 of time because interval is low (1s). Remove the first test for critical as it is tested elsewhere. Setting the interval to 3s of interval also fixes the issue but slow down all tests This will avoid test to be unstable and fix hashicorp#7425. test protocal on my machine to ensure it is not a bigger problem: ```shell while true do go clean -testcache go test -timeout 30s github.com/hashicorp/consul/agent -run '^TestAgent_ReloadConfigAndKeepChecksStatus$' done ``` => with 1s interval => fails 1/3 of time => with 3s interval => never fails go test -timeout 30s github.com/hashicorp/consul/agent -run ^(TestAgent_ReloadConfigAndKeepChecksStatus)$

rboyer · 2020-03-10T16:59:27Z

I really suspect that something fishier is going on (though I'm happy to be proved wrong). I remember originally fixing the check state snapshot/restore code to not be racy before (and I guess unfortunately introducing a different bug). This bug manifestation seems suspiciously similar to the original pre-fixed version of the code that was racy.

pierresouchay · 2020-03-10T17:25:32Z

@rboyer I think in this case, this is just my test being racy (I mean 1s interval reached before testing, too agressive when machine was loaded). Anyway, I could reproduce it clearly on my machine.

Maybe one of our changes in state (we had lots of racy conditions in our clusters previously) did fix your old issue:

crhino · 2020-03-24T16:42:10Z

Hey @pierresouchay, I ran into this today and also took a crack at fixing it before someone pointed me to your PR!

In talking about it with other team members, they mentioned using a TTL check would alleviate all timing problems and make it deterministic. #7490 changes the test to do that. Would love your input on that approach!

pierresouchay · 2020-03-25T08:14:18Z

@crhino Yes, #7490 sounds like a better fix, closing this PR

pierresouchay mentioned this pull request Mar 10, 2020

flaky test: TestAgent_ReloadConfigAndKeepChecksStatus #7425

Closed

pierresouchay closed this Mar 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix unstable TestAgent_ReloadConfigAndKeepChecksStatus #7426

Fix unstable TestAgent_ReloadConfigAndKeepChecksStatus #7426

pierresouchay commented Mar 10, 2020 •

edited

Loading

rboyer commented Mar 10, 2020

pierresouchay commented Mar 10, 2020

crhino commented Mar 24, 2020

pierresouchay commented Mar 25, 2020

Fix unstable TestAgent_ReloadConfigAndKeepChecksStatus #7426

Fix unstable TestAgent_ReloadConfigAndKeepChecksStatus #7426

Conversation

pierresouchay commented Mar 10, 2020 • edited Loading

rboyer commented Mar 10, 2020

pierresouchay commented Mar 10, 2020

crhino commented Mar 24, 2020

pierresouchay commented Mar 25, 2020

pierresouchay commented Mar 10, 2020 •

edited

Loading