Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FlowAggregator e2e test IntraNodeFlow is flaky #2369

Closed
zyiou opened this issue Jul 9, 2021 · 4 comments
Closed

FlowAggregator e2e test IntraNodeFlow is flaky #2369

zyiou opened this issue Jul 9, 2021 · 4 comments
Labels
area/flow-visibility Issues or PRs related to flow visibility support in Antrea area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@zyiou
Copy link
Contributor

zyiou commented Jul 9, 2021

Describe the bug
In Kind e2e tests run, IntraNodeFlow test will fail with following log:

assertion_compare.go:221: 
        	Error Trace:	flowaggregator_test.go:603
        	flowaggregator_test.go:213
        	Error:      	"1" is not greater than or equal to "2"
        	Test:       	TestFlowAggregator/IPv4/IntraNodeFlows
        	Messages:   	[IPFIX collector should receive expected number of flow records. Considered records: %s

It happens only on "E2e tests on a Kind cluster on Linux with Antrea-native policies disabled", not sure whether it is related.
Failure runs:
https://github.com/antrea-io/antrea/runs/3021700709?check_suite_focus=true
https://github.com/antrea-io/antrea/runs/3024202914?check_suite_focus=true

To Reproduce
Run e2e tests multiple times on Kind cluster at ToT

Expected
test should pass.

Actual behavior
See description

@zyiou zyiou added area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug. area/flow-visibility Issues or PRs related to flow visibility support in Antrea labels Jul 9, 2021
@antoninbas
Copy link
Contributor

I think that's related to a recent change by @srikartati ?

@srikartati
Copy link
Member

srikartati commented Jul 9, 2021

I see the same error here when patch #2308 is merged: https://github.com/antrea-io/antrea/runs/3013964846
Unfortunately, the flakiness seems to be still there even after resolving an issue of not considering complete logs, which was causing the same intermittent failure.
I have seen this locally but did not have sufficient logs to debug further and did not hit that error again. I also did not hit this error after running multiple times on Kind tests in CI.

I added some more instrumentation in the test to debug. In all the failed cases, the aggregated flow record sent because of idle flow timeout immediately after the start of iperf traffic is missing.
Looking at those collector logs, I see that the templates were sent two times by the flow aggregator with a considerable time gap of 4 to 9s. The second set of templates is right around the time when the iperf traffic starts. The exporting process in the Flow Aggregator may have restarted and that is why we see two sets of templates. I am presuming that the collection process in the flow aggregator would have restarted and the first aggregated flow record was not received at all as expected.

One way is to wait for the flow aggregator to be completely ready before starting iperf traffic. I do not know how, but that is one way.

@srikartati
Copy link
Member

Looking at those collector logs, I see that the templates were sent two times by the flow aggregator with a considerable time gap of 4 to 9s. The second set of templates is right around the time when the iperf traffic starts. The exporting process in the Flow Aggregator may have restarted and that is why we see two sets of templates.

The exporting process reset can be definitely explained by the bug fix in PR #2546
We should keep track of the flakiness after the PR is merged.

@github-actions
Copy link
Contributor

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment, or this will be closed in 90 days

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/flow-visibility Issues or PRs related to flow visibility support in Antrea area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants