-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flakiness in Kind e2e flow aggregator tests #2308
Conversation
Codecov Report
@@ Coverage Diff @@
## main #2308 +/- ##
===========================================
+ Coverage 41.66% 65.47% +23.81%
===========================================
Files 144 281 +137
Lines 17810 21771 +3961
===========================================
+ Hits 7420 14254 +6834
+ Misses 9732 6071 -3661
- Partials 658 1446 +788
Flags with carried forward coverage won't be shown. Click here to find out more.
|
0f20c7b
to
7bcb10f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into the test failures
test/e2e/flowaggregator_test.go
Outdated
// Check the source port along with source and destination IPs as there | ||
// are flow records for control flows during iperf traffics with same IPs | ||
// and destination port. | ||
if strings.Contains(record, srcIP) && strings.Contains(record, dstIP) && strings.Contains(record, srcPort) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's the other bug right? please make sure it's mentioned in the commit message / PR description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is needed here as well because we are accounting for incorrect records (corresponding to the control flow) and passing the test prior to this PR. Therefore, using source port to account for correct records. Added this point in the PR description. Will also add about the bandwidth test issue in the PR description.
test/e2e/flowaggregator_test.go
Outdated
@@ -994,3 +1014,22 @@ func deletePerftestServices(t *testing.T, data *TestData) { | |||
} | |||
} | |||
} | |||
|
|||
// getBandwidthAndSourcePort parses iperf commands output and returns bandwidth | |||
// and source port. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add that the bandwidth is returned as a slice containing 2 strings (bandwidth value and bandwidth unit)
9494ac2
to
93c588a
Compare
/test-ipv6-e2e |
Kind e2e tests are often failing because of the following Antrea network policy tests:
|
if conn.StartTime.IsZero() { | ||
conn.StartTime = time.Now() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel that we can include fix of this #1417 in this PR too. It is about start time too but the default value is different (flowStartSeconds: 2288912640 or StartTime: Jul 14, 2042 03:04:00.000000000 CEST)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
test/e2e/flowaggregator_test.go
Outdated
cmd := fmt.Sprintf("kubectl -n %s logs -c antrea-agent %s", nsName, podName) | ||
_, antreaLogs, _, err := provider.RunCommandOnNode(controlPlaneNodeName(), cmd) | ||
t.Logf("Logs for %s: %v err: %v", podName, antreaLogs, err) | ||
return err | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the test fails, we can view the antrea-agent logs from GitHub workflow. Is this log for debugging Jenkins issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a temporary addition for this PR to see if there are any intermittent kind CI failures and debug them. Will remove this before merging.
// getBandwidthAndSourcePort parses iperf commands output and returns bandwidth | ||
// and source port. Bandwidth is returned as a slice containing two strings (bandwidth | ||
// value and bandwidth unit). | ||
func getBandwidthAndSourcePort(iperfStdout string) ([]string, string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can add a sample output here for elaboration. Also if the iperf doesn't send the traffic correctly, will the result be the same? (srcPort return empty string etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We check for the error after executing iperf command before parsing the output on stdout. It will take care of the mentioned scenario.
93c588a
to
b250918
Compare
b250918
to
22a3832
Compare
Ran the kind tests on CI (more than 10 times) on this PR from last 2-3 days. I did not see any Flow Aggregator tests failure. However, there were some repeated failures as mentioned here: #2308 (comment) |
test/e2e/flowaggregator_test.go
Outdated
dataRecordsCount := 0 | ||
var octetTotalCount uint64 | ||
for _, record := range recordSlices { | ||
if strings.Contains(record, srcIP) && strings.Contains(record, dstIP) { | ||
// Check the source port along with source and destination IPs as there | ||
// are flow records for control flows during iperf traffics with same IPs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/iperf traffics/iperf connections ?
traffic can never be plural
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was an unintentional typo :)
22a3832
to
5cc3c5b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@srikartati maybe you just got very unlucky but all the FlowAggregator tests failed in Jenkins:
https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2815/consoleFull |
Yeah.. I looked through the collected logs. However, this is a very different error compared to the issues that this PR is solving.
I ran the Jenkins tests multiple times including v6 tests last few days and did not hit this. I think the proper way to solve this is to have a retry logic instead of failing immediately, i.e., if there is no issue with the Kubernetes cluster itself, which seems unlikely here as all other tests are passed. (EDIT) |
@@ -340,6 +342,7 @@ func (exp *flowExporter) sendFlowRecords() error { | |||
return err | |||
} | |||
exp.numDataSetsSent = exp.numDataSetsSent + 1 | |||
klog.V(4).InfoS("Deny connection sent successfully", "flowKey", connKey, "record", conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/record/connection/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct. We need to keep it consistent with L334 but missed it here. Rephrased the log for correctness.
/test-e2e |
- Flows are not getting expired as expected because of zero start time and last export time is initialized as start time.This commit fixes that bug by adding current time as start time if it is zero when a new connection is added. - Idle flow export timeout is not configured properly in the e2e test. There is a bug in the test and this commit fixes that. - The expected number of records is not correct in Kind clusters as packet counters are not supported in conntrack entries for OVS userspace datapath. Flow exporter cannot send records after active timeout expiry as we cannot consider flows to be active without the statistics. Therefore, we expect only 2 records for any given flow in Kind clusters and 3 flow records in other clusters. We expect this limitation to be resolved soon when OVS userspace datapath support statistics in conntrack entries. - Additionally, we consider source port from iperf flows to ignore other similar flow records from the iperf control flows. Because of this we were passing the tests in Kind clusters even though there are only two flow records. - Kubectl logs command with since option is not outputting the logs of ipfix collector as expected and this causes intermittent failures. Removing the since option as we have source port as the gating parameter when processing flow records. Signed-off-by: Srikar Tati <[email protected]>
5cc3c5b
to
03d1031
Compare
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/test-ipv6-e2e |
Flows are not getting expired as expected because of zero start time and the last export time is initialized as the start time. This commit fixes that bug by adding the current time as start time if it is zero when a new connection is added. Fixes: IPFIX start and stop timestamps are wrong/too large #1417
Idle flow export timeout is not configured properly in the e2e test. There is a bug in the test and this commit fixes that. This was introduced in PR: Fortify the bandwidth e2e test of flow aggregator #1802. Ref: https://github.com/antrea-io/antrea/blame/main/test/e2e/framework.go#L572
The expected number of records is not correct in Kind clusters as packet counters are not supported in conntrack entries for OVS userspace datapath. Flow exporter cannot send records after active timeout expiry as we cannot consider flows to be active without the statistics. Therefore, we expect only 2 records for any given flow in Kind clusters and 3 flow records in other clusters. We expect this limitation to be resolved soon when OVS userspace datapath support statistics in conntrack entries. This was missed when bandwidth tests are modified for Kind clusters in PR Fortify the bandwidth e2e test of flow aggregator #1802.
Additionally, we consider source port from iperf command output to ignore other similar flow records from the iperf control flows. Because of this we were passing the tests in Kind clusters even though there are only two flow
records while the expected records are 3. In addition, the source port gating fixes flakiness in bandwidth tests. Fixes FlowAggregator bandwidth tests are too flaky #2283
Kubectl logs command with since option is not outputting the logs of ipfix collector as expected and this causes intermittent failures. Removing the since option as we have source port as the gating parameter when processing
flow records.