Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unstable TestAgent_ReloadConfigAndKeepChecksStatus #7426

Conversation

pierresouchay
Copy link
Contributor

@pierresouchay pierresouchay commented Mar 10, 2020

On my machine, with some load, TestAgent_ReloadConfigAndKeepChecksStatus is failing 1/3 of time
because interval is low (1s). Remove the first test for critical as it is tested elsewhere.

Setting the interval to 3s of interval also fixes the issue but slow down all tests

This will avoid test to be unstable and fix #7425.

test protocal on my machine to ensure it is not a bigger problem:

while true
do
  go clean -testcache
  go test -timeout 30s github.com/hashicorp/consul/agent -run '^TestAgent_ReloadConfigAndKeepChecksStatus$'
done

=> with 1s interval => fails 1/3 of time
=> with 3s interval => never fails

On my machine, with some load, TestAgent_ReloadConfigAndKeepChecksStatus is failing 1/3 of time
because interval is low (1s). Remove the first test for critical as it is tested elsewhere.

Setting the interval to 3s of interval also fixes the issue but slow down all tests

This will avoid test to be unstable and fix hashicorp#7425.

test protocal on my machine to ensure it is not a bigger problem:

```shell
while true
do
  go clean -testcache
  go test -timeout 30s github.com/hashicorp/consul/agent -run '^TestAgent_ReloadConfigAndKeepChecksStatus$'
done
```

=> with 1s interval => fails 1/3 of time
=> with 3s interval => never fails

go test -timeout 30s github.com/hashicorp/consul/agent -run ^(TestAgent_ReloadConfigAndKeepChecksStatus)$
@rboyer
Copy link
Member

rboyer commented Mar 10, 2020

I really suspect that something fishier is going on (though I'm happy to be proved wrong). I remember originally fixing the check state snapshot/restore code to not be racy before (and I guess unfortunately introducing a different bug). This bug manifestation seems suspiciously similar to the original pre-fixed version of the code that was racy.

@pierresouchay
Copy link
Contributor Author

@rboyer I think in this case, this is just my test being racy (I mean 1s interval reached before testing, too agressive when machine was loaded). Anyway, I could reproduce it clearly on my machine.

Maybe one of our changes in state (we had lots of racy conditions in our clusters previously) did fix your old issue:

@crhino
Copy link
Contributor

crhino commented Mar 24, 2020

Hey @pierresouchay, I ran into this today and also took a crack at fixing it before someone pointed me to your PR!

In talking about it with other team members, they mentioned using a TTL check would alleviate all timing problems and make it deterministic. #7490 changes the test to do that. Would love your input on that approach!

@pierresouchay
Copy link
Contributor Author

@crhino Yes, #7490 sounds like a better fix, closing this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

flaky test: TestAgent_ReloadConfigAndKeepChecksStatus
3 participants