-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FlowAggregator bandwidth tests are too flaky #2283
Comments
(Moved comment to #2282 as it is more related) |
@zyiou This is a good find. Makes sense to cover all the cases through different templates. This issue is pertinent to the dual-stack case and doesn't explain the flakiness in other test setups which are raised by this PR. Therefore, it is better to move in Issue #2282 I think we have to check the combination before sending each aggregated flow record. If that is the case, it's better to have this info as metadata and pick the correct template ID when adding the data record. |
@dreamtalen Could you comment on this bandwidth issue based on your experience while working on a similar bug earlier: #2211? Are these related? |
Sure, the error in this screenshoot shows that test failed at the average bandwidth check using |
I thought the intention of workaround placed in the e2e flow aggregator test is to make the test stable. Do you know why it became flaky? |
I don't think current workaround will make the test flaky, just triggered the e2e test with a local ipv6 vagrant cluster multiple times and no failure happened. Maybe we should look through the records received by collector in this failed case to find the root cause. |
@antoninbas Hi Antonin, is this failure shown in the screenshot inside a ipv6 only jenkins test job? Could you share how frequently it happens and how to reproduce it if possible, thanks! |
@dreamtalen all the information I have is in the issue. I don't recall which testbed it was, but I do know I observed the same issue on IPv4 testbeds as well. There was a memory leak in the test binary until recently causing a lot of test failures as one of the Nodes (control-plane Node) would run out of memory. It's possible that it could explain such failures, although I'm not sure how. You can keep monitoring the jobs to see if it happens again. |
@dreamtalen I just saw this failure on an IPv4 testbed: https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2745/console
This PR doesn't have the memory leak patch, but I am not sure the memory leak affected IPv4 testbeds. |
Thanks Antonin, I'm looking at it. |
Updated: reproduce this bandwidth check failure successfully. Find the root cause is that ipfix-collector only received two data records and missed the last one, so that the average bandwidth calculated using antrea/test/e2e/flowaggregator_test.go Line 742 in 09847f7
To solve this issue, we need to improve the judge condition in wait.PollImmediate() to only consider the data flow record of iperf traffic, --cport n Option to specify the client-side port of iperf command may help distinguish the control and data flow bacause they have different client-side ports.
|
Great. Source port could definitely help in resolving the issue. Prior we talked about using |
Describe the bug
I have been seeing test failures like this one frequently:
It seems to happen more frequently for IPv6 e2e test jobs than for IPv4 e2e test jobs, but I don't know if there is a real correlation there.
The text was updated successfully, but these errors were encountered: