FlowAggregator bandwidth tests are too flaky #2283

antoninbas · 2021-06-16T20:34:22Z

Describe the bug
I have been seeing test failures like this one frequently:

=== RUN   TestFlowAggregator/IPv6/LocalServiceAccess
    flowaggregator_test.go:555: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:675: Iperf throughput: 24678.40 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 23130.28 Mbits/s
    flowaggregator_test.go:564: Check the average bandwidth using octetTotalCountFromSourceNode 26185840273 in data record.
    flowaggregator_test.go:675: Iperf throughput: 24678.40 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 17457.23 Mbits/s
    flowaggregator_test.go:676: 
        	Error Trace:	flowaggregator_test.go:676
        	            				flowaggregator_test.go:565
        	            				flowaggregator_test.go:455
        	Error:      	Max difference between 17457.22684866667 and 24678.4 allowed is 3701.76, but difference was -7221.1731513333325
        	Test:       	TestFlowAggregator/IPv6/LocalServiceAccess
        	Messages:   	Difference between Iperf bandwidth and IPFIX record bandwidth calculated through octetTotalCountFromSourceNode should be lower than 15%
=== RUN   TestFlowAggregator/IPv6/RemoteServiceAccess
    flowaggregator_test.go:555: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:675: Iperf throughput: 562.00 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 561.73 Mbits/s
    flowaggregator_test.go:564: Check the average bandwidth using octetTotalCountFromSourceNode 842696547 in data record.
    flowaggregator_test.go:675: Iperf throughput: 562.00 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 561.80 Mbits/s
=== CONT  TestFlowAggregator
    fixtures.go:338: Deleting 'flow-aggregator' K8s Namespace
    fixtures.go:224: Exporting test logs to '/var/lib/jenkins/workspace/antrea-ipv6-only-e2e-for-pull-request/antrea-test-logs/TestFlowAggregator/beforeTeardown.Jun14-15-24-38'
    fixtures.go:328: Error when exporting kubelet logs: error when running journalctl on Node 'antrea-ipv6-2-0', is it available? Error: <nil>
    fixtures.go:349: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestFlowAggregator (247.98s)
    --- FAIL: TestFlowAggregator/IPv6 (101.84s)
        --- PASS: TestFlowAggregator/IPv6/IntraNodeFlows (14.54s)
        --- PASS: TestFlowAggregator/IPv6/IntraNodeDenyConnIngressANP (3.28s)
        --- PASS: TestFlowAggregator/IPv6/IntraNodeDenyConnEgressANP (4.53s)
        --- PASS: TestFlowAggregator/IPv6/IntraNodeDenyConnNP (5.19s)
        --- PASS: TestFlowAggregator/IPv6/InterNodeFlows (14.61s)
        --- PASS: TestFlowAggregator/IPv6/InterNodeDenyConnIngressANP (3.29s)
        --- PASS: TestFlowAggregator/IPv6/InterNodeDenyConnEgressANP (3.39s)
        --- PASS: TestFlowAggregator/IPv6/InterNodeDenyConnNP (9.55s)
        --- PASS: TestFlowAggregator/IPv6/ToExternalFlows (12.29s)
        --- FAIL: TestFlowAggregator/IPv6/LocalServiceAccess (13.18s)
        --- PASS: TestFlowAggregator/IPv6/RemoteServiceAccess (14.90s)

It seems to happen more frequently for IPv6 e2e test jobs than for IPv4 e2e test jobs, but I don't know if there is a real correlation there.

The text was updated successfully, but these errors were encountered:

zyiou · 2021-06-17T05:53:36Z

(Moved comment to #2282 as it is more related)

srikartati · 2021-06-17T17:30:24Z

@zyiou This is a good find. Makes sense to cover all the cases through different templates. This issue is pertinent to the dual-stack case and doesn't explain the flakiness in other test setups which are raised by this PR. Therefore, it is better to move in Issue #2282

I think we have to check the combination before sending each aggregated flow record. If that is the case, it's better to have this info as metadata and pick the correct template ID when adding the data record.

srikartati · 2021-06-22T18:48:19Z

@dreamtalen Could you comment on this bandwidth issue based on your experience while working on a similar bug earlier: #2211? Are these related?

dreamtalen · 2021-06-22T19:01:02Z

@dreamtalen Could you comment on this bandwidth issue based on your experience while working on a similar bug earlier: #2211? Are these related?

Sure, the error in this screenshoot shows that test failed at the average bandwidth check using octetTotalCountFromSourceNode, and the value 17457.23 Mbits/s is 25% less than we expected, which means that in the last record, the flowEndSeconds is updated while the octetTotalCountFromSourceNode has not been updated (missed the last flow record from sourceNode at timestamp 12s since traffic begin), so I think it is related to issue #2211.

srikartati · 2021-06-22T19:19:34Z

Sure, the error in this screenshoot shows that test failed at the average bandwidth check using octetTotalCountFromSourceNode, and the value 17457.23 Mbits/s is 25% less than we expected, which means that in the last record, the flowEndSeconds is updated while the octetTotalCountFromSourceNode has not been updated (missed the last flow record from sourceNode at timestamp 12s since traffic begin), so I think it is related to issue #2211.

I thought the intention of workaround placed in the e2e flow aggregator test is to make the test stable. Do you know why it became flaky?

dreamtalen · 2021-06-22T20:45:16Z

Sure, the error in this screenshoot shows that test failed at the average bandwidth check using octetTotalCountFromSourceNode, and the value 17457.23 Mbits/s is 25% less than we expected, which means that in the last record, the flowEndSeconds is updated while the octetTotalCountFromSourceNode has not been updated (missed the last flow record from sourceNode at timestamp 12s since traffic begin), so I think it is related to issue #2211.

I thought the intention of workaround placed in the e2e flow aggregator test is to make the test stable. Do you know why it became flaky?

I don't think current workaround will make the test flaky, just triggered the e2e test with a local ipv6 vagrant cluster multiple times and no failure happened. Maybe we should look through the records received by collector in this failed case to find the root cause.

dreamtalen · 2021-06-23T18:35:34Z

@antoninbas Hi Antonin, is this failure shown in the screenshot inside a ipv6 only jenkins test job? Could you share how frequently it happens and how to reproduce it if possible, thanks!

antoninbas · 2021-06-23T20:02:00Z

@dreamtalen all the information I have is in the issue. I don't recall which testbed it was, but I do know I observed the same issue on IPv4 testbeds as well. There was a memory leak in the test binary until recently causing a lot of test failures as one of the Nodes (control-plane Node) would run out of memory. It's possible that it could explain such failures, although I'm not sure how. You can keep monitoring the jobs to see if it happens again.

antoninbas · 2021-06-23T23:38:16Z

@dreamtalen I just saw this failure on an IPv4 testbed: https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2745/console

=== RUN   TestFlowAggregator/IPv4/LocalServiceAccess
    flowaggregator_test.go:568: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:688: Iperf throughput: 18227.20 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 17747.55 Mbits/s
    flowaggregator_test.go:577: Check the average bandwidth using octetTotalCountFromSourceNode 26684556369 in data record.
    flowaggregator_test.go:688: Iperf throughput: 18227.20 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 17789.70 Mbits/s
=== RUN   TestFlowAggregator/IPv4/RemoteServiceAccess
    flowaggregator_test.go:568: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:688: Iperf throughput: 5416.96 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 5303.33 Mbits/s
    flowaggregator_test.go:577: Check the average bandwidth using octetTotalCountFromSourceNode 5754020579 in data record.
    flowaggregator_test.go:688: Iperf throughput: 5416.96 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 3836.01 Mbits/s
    flowaggregator_test.go:689: 
        	Error Trace:	flowaggregator_test.go:689
        	            				flowaggregator_test.go:578
        	            				flowaggregator_test.go:478
        	Error:      	Max difference between 3836.0137193333335 and 5416.96 allowed is 812.544, but difference was -1580.9462806666666
        	Test:       	TestFlowAggregator/IPv4/RemoteServiceAccess
        	Messages:   	Difference between Iperf bandwidth and IPFIX record bandwidth calculated through octetTotalCountFromSourceNode should be lower than 15%
=== CONT  TestFlowAggregator
    fixtures.go:328: Deleting 'flow-aggregator' K8s Namespace
    fixtures.go:214: Exporting test logs to '/var/lib/jenkins/workspace/antrea-e2e-for-pull-request/antrea-test-logs/TestFlowAggregator/beforeTeardown.Jun23-21-34-50'
    fixtures.go:318: Error when exporting kubelet logs: error when running journalctl on Node 'antrea-e2e-for-pull-request-2745-7rdnp', is it available? Error: <nil>
    fixtures.go:339: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestFlowAggregator (229.49s)
    --- FAIL: TestFlowAggregator/IPv4 (94.66s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeFlows (14.06s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnIngressANP (3.37s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnEgressANP (3.38s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnNP (4.56s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeFlows (13.53s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnIngressANP (3.40s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnEgressANP (2.59s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnNP (8.88s)
        --- PASS: TestFlowAggregator/IPv4/ToExternalFlows (10.27s)
        --- PASS: TestFlowAggregator/IPv4/LocalServiceAccess (14.05s)
        --- FAIL: TestFlowAggregator/IPv4/RemoteServiceAccess (13.45s)

This PR doesn't have the memory leak patch, but I am not sure the memory leak affected IPv4 testbeds.

dreamtalen · 2021-06-23T23:44:22Z

@dreamtalen I just saw this failure on an IPv4 testbed: https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2745/console

=== RUN   TestFlowAggregator/IPv4/LocalServiceAccess
    flowaggregator_test.go:568: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:688: Iperf throughput: 18227.20 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 17747.55 Mbits/s
    flowaggregator_test.go:577: Check the average bandwidth using octetTotalCountFromSourceNode 26684556369 in data record.
    flowaggregator_test.go:688: Iperf throughput: 18227.20 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 17789.70 Mbits/s
=== RUN   TestFlowAggregator/IPv4/RemoteServiceAccess
    flowaggregator_test.go:568: Check the bandwidth using octetDeltaCountFromSourceNode in data record.
    flowaggregator_test.go:688: Iperf throughput: 5416.96 Mbits/s, IPFIX record throughput calculated through octetDeltaCountFromSourceNode: 5303.33 Mbits/s
    flowaggregator_test.go:577: Check the average bandwidth using octetTotalCountFromSourceNode 5754020579 in data record.
    flowaggregator_test.go:688: Iperf throughput: 5416.96 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 3836.01 Mbits/s
    flowaggregator_test.go:689: 
        	Error Trace:	flowaggregator_test.go:689
        	            				flowaggregator_test.go:578
        	            				flowaggregator_test.go:478
        	Error:      	Max difference between 3836.0137193333335 and 5416.96 allowed is 812.544, but difference was -1580.9462806666666
        	Test:       	TestFlowAggregator/IPv4/RemoteServiceAccess
        	Messages:   	Difference between Iperf bandwidth and IPFIX record bandwidth calculated through octetTotalCountFromSourceNode should be lower than 15%
=== CONT  TestFlowAggregator
    fixtures.go:328: Deleting 'flow-aggregator' K8s Namespace
    fixtures.go:214: Exporting test logs to '/var/lib/jenkins/workspace/antrea-e2e-for-pull-request/antrea-test-logs/TestFlowAggregator/beforeTeardown.Jun23-21-34-50'
    fixtures.go:318: Error when exporting kubelet logs: error when running journalctl on Node 'antrea-e2e-for-pull-request-2745-7rdnp', is it available? Error: <nil>
    fixtures.go:339: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestFlowAggregator (229.49s)
    --- FAIL: TestFlowAggregator/IPv4 (94.66s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeFlows (14.06s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnIngressANP (3.37s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnEgressANP (3.38s)
        --- PASS: TestFlowAggregator/IPv4/IntraNodeDenyConnNP (4.56s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeFlows (13.53s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnIngressANP (3.40s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnEgressANP (2.59s)
        --- PASS: TestFlowAggregator/IPv4/InterNodeDenyConnNP (8.88s)
        --- PASS: TestFlowAggregator/IPv4/ToExternalFlows (10.27s)
        --- PASS: TestFlowAggregator/IPv4/LocalServiceAccess (14.05s)
        --- FAIL: TestFlowAggregator/IPv4/RemoteServiceAccess (13.45s)

This PR doesn't have the memory leak patch, but I am not sure the memory leak affected IPv4 testbeds.

Thanks Antonin, I'm looking at it.

dreamtalen · 2021-06-24T23:31:51Z

Updated: reproduce this bandwidth check failure successfully. Find the root cause is that ipfix-collector only received two data records and missed the last one, so that the average bandwidth calculated using octetTotalCountFromSourceNode is wrong.
The reason why collector didn't receive the last data record is because there are control flow and data flow record of iperf traffic, and in the corner case, collector just received the last control flow record but hasn't received the data flow record yet, it stops the wait.PollImmediate() at

antrea/test/e2e/flowaggregator_test.go

Line 742 in 09847f7

if exportTime >= timeStart.Unix()+iperfTimeSec {

To solve this issue, we need to improve the judge condition in wait.PollImmediate() to only consider the data flow record of iperf traffic, --cport n Option to specify the client-side port of iperf command may help distinguish the control and data flow bacause they have different client-side ports.

srikartati · 2021-06-25T18:52:26Z

To solve this issue, we need to improve the judge condition in wait.PollImmediate() to only consider the data flow record of iperf traffic, --cport n Option to specify the client-side port of iperf command may help distinguish the control and data flow bacause they have different client-side ports.

Great. Source port could definitely help in resolving the issue. Prior we talked about using cport which was not implemented when improving the bandwidth tests. We could also get the source port from iperf command output. I incorporated the source port based detection in PR #2308. It should resolve the issue. Let me know if you think otherwise.

antoninbas added area/test/e2e Issues or PRs related to Antrea specific end-to-end testing. kind/bug Categorizes issue or PR as related to a bug. labels Jun 16, 2021

antoninbas assigned srikartati and zyiou Jun 16, 2021

antoninbas changed the title ~~FlowAggregator bandwidth tests are too falky~~ FlowAggregator bandwidth tests are too flaky Jun 16, 2021

tnqn mentioned this issue Jun 25, 2021

Fix race condition in EgressController and unit tests #2307

Merged

tnqn mentioned this issue Jun 29, 2021

Fix cross-Node service access when AntreaProxy is disabled #2318

Merged

srikartati mentioned this issue Jun 29, 2021

Fix flakiness in Kind e2e flow aggregator tests #2308

Merged

tnqn mentioned this issue Jul 5, 2021

Fix intra-Node service access when both Egress and AntreaProxy is enabled #2332

Merged

antoninbas closed this as completed in #2308 Jul 7, 2021

srikartati mentioned this issue Nov 4, 2021

TestFlowAggregator failed when AntreaIPAM is used #2980

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlowAggregator bandwidth tests are too flaky #2283

FlowAggregator bandwidth tests are too flaky #2283

antoninbas commented Jun 16, 2021

zyiou commented Jun 17, 2021 •

edited

Loading

srikartati commented Jun 17, 2021

srikartati commented Jun 22, 2021

dreamtalen commented Jun 22, 2021

srikartati commented Jun 22, 2021 •

edited

Loading

dreamtalen commented Jun 22, 2021

dreamtalen commented Jun 23, 2021

antoninbas commented Jun 23, 2021

antoninbas commented Jun 23, 2021

dreamtalen commented Jun 23, 2021

dreamtalen commented Jun 24, 2021 •

edited

Loading

srikartati commented Jun 25, 2021

FlowAggregator bandwidth tests are too flaky #2283

FlowAggregator bandwidth tests are too flaky #2283

Comments

antoninbas commented Jun 16, 2021

zyiou commented Jun 17, 2021 • edited Loading

srikartati commented Jun 17, 2021

srikartati commented Jun 22, 2021

dreamtalen commented Jun 22, 2021

srikartati commented Jun 22, 2021 • edited Loading

dreamtalen commented Jun 22, 2021

dreamtalen commented Jun 23, 2021

antoninbas commented Jun 23, 2021

antoninbas commented Jun 23, 2021

dreamtalen commented Jun 23, 2021

dreamtalen commented Jun 24, 2021 • edited Loading

srikartati commented Jun 25, 2021

zyiou commented Jun 17, 2021 •

edited

Loading

srikartati commented Jun 22, 2021 •

edited

Loading

dreamtalen commented Jun 24, 2021 •

edited

Loading