Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TestFlowAggregator failed when AntreaIPAM is used #2980

Closed
tnqn opened this issue Nov 4, 2021 · 7 comments · Fixed by #2983
Closed

TestFlowAggregator failed when AntreaIPAM is used #2980

tnqn opened this issue Nov 4, 2021 · 7 comments · Fixed by #2983
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@tnqn
Copy link
Member

tnqn commented Nov 4, 2021

Describe the bug

The test failed very frequently in "jenkins-flexible-ipam-e2e", on which cluster AntreaIPAM is enabled:

=== RUN   TestFlowAggregator/IPv4/InterNodeFlows
I1104 11:31:48.223994   27153 k8s_util.go:655] Creating/updating Antrea NetworkPolicy antrea-test/test-flow-aggregator-antrea-networkpolicy-ingress
I1104 11:31:48.237402   27153 k8s_util.go:655] Creating/updating Antrea NetworkPolicy antrea-test/test-flow-aggregator-antrea-networkpolicy-egress
    flowaggregator_test.go:874: Antrea Network Policies are realized.
    flowaggregator_test.go:601: Check the average bandwidth using octetTotalCountFromSourceNode 7731439957 in data record.
    flowaggregator_test.go:714: Iperf throughput: 6993.92 Mbits/s, IPFIX record throughput calculated through octetTotalCountFromSourceNode: 5154.29 Mbits/s
    flowaggregator_test.go:715: 
        	Error Trace:	flowaggregator_test.go:715
        	            				flowaggregator_test.go:602
        	            				flowaggregator_test.go:340
        	Error:      	Max difference between 5154.293304666667 and 6993.92 allowed is 1049.088, but difference was -1839.6266953333334
        	Test:       	TestFlowAggregator/IPv4/InterNodeFlows
        	Messages:   	Difference between Iperf bandwidth and IPFIX record bandwidth calculated through octetTotalCountFromSourceNode should be lower than 15%
    assertion_compare.go:323: 
        	Error Trace:	flowaggregator_test.go:606
        	            				flowaggregator_test.go:340
        	Error:      	"2" is not greater than or equal to "3"
        	Test:       	TestFlowAggregator/IPv4/InterNodeFlows
        	Messages:   	[IPFIX collector should receive expected number of flow records. Considered records: %s 
        	            	 Collector output: %s [I1104 11:30:30.044932       1 collector.go:149] Starting IPFIX collector
        	            	I1104 11:30:30.046424       1 tcp.go:38] Start TCP collecting process on [::]:4739
        	            	I1104 11:30:53.374638       1 collector.go:172] Processing IPFIX message
        	            	I1104 11:30:53.374655       1 collector.go:172] Processing IPFIX message
...

Versions:

  • Antrea version (Docker image tag). main-a48f4db0608b30cb420032f5e8e42eb79fa499cc, v1.4.0
@tnqn tnqn added the kind/bug Categorizes issue or PR as related to a bug. label Nov 4, 2021
@tnqn
Copy link
Member Author

tnqn commented Nov 4, 2021

@srikartati could you please take a look or assign it to proper developer?

@tnqn tnqn mentioned this issue Nov 4, 2021
@srikartati
Copy link
Member

Hi @dreamtalen, As this issue is very similar to the one you root caused before #2283 (comment)
Could you take a look at what is different in the new testbed that the flexible IPAM feature is using? The test is passing on other CI test beds but failing frequently on the "jenkins-flexible-ipam-e2e" test bed.

@dreamtalen
Copy link
Contributor

Hi @dreamtalen, As this issue is very similar to the one you root caused before #2283 (comment) Could you take a look at what is different in the new testbed that the flexible IPAM feature is using? The test is passing on other CI test beds but failing frequently on the "jenkins-flexible-ipam-e2e" test bed.

Sure, I will take a look.

@dreamtalen
Copy link
Contributor

Update: looks like this failure may cause by a time difference between k8s pod and testbed host.
An example of logs from a failed test running on ipam testbed: http://10.176.27.169:8080/job/antrea-flexible-ipam-e2e-for-pull-request/139/consoleFull

flowaggregator_test.go:788: Flow passed condition with flowStartSeconds 1636070251, flowEndSeconds 1636070259, timeStart 1636070245, srcPort 52804
flowaggregator_test.go:591: 1st Data record with flowStartSeconds 1636070251, flowEndSeconds 1636070255, timeStart 1636070245, totalCount 11998432821
flowaggregator_test.go:591: 2nd Data record with flowStartSeconds 1636070251, flowEndSeconds 1636070259, timeStart 1636070245, totalCount 26157627473
flowaggregator_test.go:591: 3rd Data record with flowStartSeconds 1636070251, flowEndSeconds 1636070263, timeStart 1636070245, totalCount 38674431709

We could see there is a 6 seconds difference between flowStartSeconds and timeStart. However in other testbed, this difference is less than 1 second, like https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/3506/consoleFull.
Since our current condition is

if exportTime >= timeStart.Unix()+iperfTimeSec {

In this case, 2nd data record could pass the condition which caused e2e test didn't wait for the 3rd data record.
To solve this issue, we could check the time sync between host and k8s Pod on this testbed, or we could change our condition to use flowStartSeconds instead of timeStart.

@antoninbas
Copy link
Contributor

@dreamtalen it's probably better to write the test in such a way that it doesn't assume time sync between 1) the host running the Go tests and 2) K8s Nodes / Pods.

@srikartati
Copy link
Member

Thanks @dreamtalen

we could change our condition to use flowStartSeconds instead of timeStart

I think comparing flowStartSeconds and flowEndSeconds seems like a decent solution for the issue.

dreamtalen pushed a commit that referenced this issue Nov 5, 2021
In this PR, we change to comparing flow export time with flow start time
in flow record instead of the test start time of the host running tests.
Fix issue #2980

Signed-off-by: Yongming Ding <[email protected]>
dreamtalen pushed a commit that referenced this issue Nov 5, 2021
In this PR, we change to comparing flow export time with flow start time
in flow record instead of the test start time of the host running tests.
Fix issue #2980

Signed-off-by: Yongming Ding <[email protected]>
dreamtalen pushed a commit that referenced this issue Nov 5, 2021
In this PR, we change to comparing flow export time with flow start time
in flow record instead of the test start time of the host running tests.
Fix issue #2980

Signed-off-by: Yongming Ding <[email protected]>
dreamtalen pushed a commit that referenced this issue Nov 8, 2021
In this PR, we change to comparing flow export time with flow start time
in flow record instead of the test start time of the host running tests.
Fix issue #2980

Signed-off-by: Yongming Ding <[email protected]>
dreamtalen pushed a commit that referenced this issue Nov 9, 2021
In this PR, we change to comparing flow export time with flow start time
in flow record instead of the test start time of the host running tests.
Fix issue #2980

Signed-off-by: Yongming Ding <[email protected]>
@dreamtalen
Copy link
Contributor

Have a fix at #2983 by removing the comparing with host time, triggered test-flexible-ipam-e2e multiple times and no FlowAggregator test failure happened again.

tnqn pushed a commit that referenced this issue Nov 11, 2021
In this PR, we change to comparing flow export time with flow start time
in flow record instead of the test start time of the host running tests.
Fix issue #2980

Signed-off-by: Yongming Ding <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants