-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add throughput calculation into flow-aggregator #2692
Conversation
8c01906
to
568198e
Compare
Codecov Report
@@ Coverage Diff @@
## main #2692 +/- ##
==========================================
- Coverage 60.85% 51.39% -9.46%
==========================================
Files 292 462 +170
Lines 24766 53579 +28813
==========================================
+ Hits 15072 27539 +12467
- Misses 8050 23706 +15656
- Partials 1644 2334 +690
Flags with carried forward coverage won't be shown. Click here to find out more.
|
568198e
to
da48bcd
Compare
da48bcd
to
c1eaec6
Compare
431d44e
to
17eace0
Compare
09b4e17
to
86c414e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you clarify why we get a more accurate throughput calculation by performing it at the aggregator? It seems to me that the formula you give in vmware/go-ipfix#270 (diff(octetTotalCount) / diff(flowEndSeconds)
) could also be computed at the collector.
145: | ||
- :uint64 | ||
- :throughput | ||
146: | ||
- :uint64 | ||
- :reverseThroughput | ||
147: | ||
- :uint64 | ||
- :throughputFromSourceNode | ||
148: | ||
- :uint64 | ||
- :throughputFromDestinationNode | ||
149: | ||
- :uint64 | ||
- :reverseThroughputFromSourceNode | ||
150: | ||
- :uint64 | ||
- :reverseThroughputFromDestinationNode | ||
151: | ||
- :uint32 | ||
- :flowEndSecondsFromSourceNode | ||
152: | ||
- :uint32 | ||
- :flowEndSecondsFromDestinationNode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't remember us having all these *SourceNode
and *DestinationNodes
IE variations for octet and packet counts. I wonder if this is really necessary. For octant and packet counts, do we actually use all these field variations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we have all these IE variations. Currently the only use case is we are using octetDeltaCountFromSourceNode/DestinationNode
to compute node throughput in logstash. I agree these IE variations can be removed if we decide to add throughput information into the exported records.
Hi @antoninbas , about why we want to do the throughput calculation in aggregator, I re-summarized the intuitions by a little bit, hope it will make a little bit more sense: *Node specific fields: *SourceNode and *DestinationNodes IE variations for octet and packet counts
|
@heanlan My concern is the growing size of the flow records and potentially the increasing CPU & memory consumption in the Flow Aggregator that goes with it. If you think this isn't an issue, we can add these fields. I originally thought throughput computation at the collector or collectors wouldn't be too much of a problem. Or that we would use a Kafka transformer to add throughput to the data once. |
With IPFIX encoding we convert data into bytes, and Kai and Srikar suggested we would use protobuf later for better efficiency. So this would not be a big issue? |
@antoninbas Do you think we can include this PR into v1.5 release? |
@heanlan Antrea v1.5 is scheduled for December 22nd, so I don't see why not. This is not a very large PR. |
E2E test will fail as the ipfix-collector image is not available yet in registry. Will re-run when it's available. |
docs/network-flow-visibility.md
Outdated
|-------------------------------------------|----------|-------------|-------------| | ||
| packetTotalCountFromSourceNode | 120 | unsigned64 | The cumulative number of packets from source to destination for this flow since the flow starts, based on the records sent from the source node. | | ||
| octetTotalCountFromSourceNode | 121 | unsigned64 | The cumulative number of octets from source to destination for this flow since the flow starts, based on the records sent from the source node. | | ||
| packetDeltaCountFromSourceNode | 122 | unsigned64 | The number of packets from source to destination since the previous report for this flow at the observation point, based on the records sent from the source node. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would simplify a bit the phrasing: "The number of packets for this flow as reported by the source Node, since the previous report for this flow at the observation point"
let me know if that works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Antonin. I simplified it accordingly.
897c729
to
14ce2ad
Compare
14ce2ad
to
1ed4c16
Compare
Image ipfix-collector:v0.5.11 still not available in the registry. Will ask help from the team tomorrow to fix it up. |
/test-all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, a typo
docs/network-flow-visibility.md
Outdated
| egressNetworkPolicyRuleName | 142 | string | Name of the egress network policy rule applied to the source Pod for this flow. | | ||
| ingressNetworkPolicyRuleAction | 139 | unsigned8 | 1 stands for Allow. 2 stands for Drop. 3 stands for Reject. | | ||
| egressNetworkPolicyRuleAction | 140 | unsigned8 | | | ||
| tcpState | 136 | string | The state of the TCP connection. The states are: LISTEN, SYN-SENT, SYNRECEIVED, ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT, and CLOSED. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| tcpState | 136 | string | The state of the TCP connection. The states are: LISTEN, SYN-SENT, SYNRECEIVED, ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT, and CLOSED. | | |
| tcpState | 136 | string | The state of the TCP connection. The states are: LISTEN, SYN-SENT, SYN-RECEIVED, ESTABLISHED, FIN-WAIT-1, FIN-WAIT-2, CLOSE-WAIT, CLOSING, LAST-ACK, TIME-WAIT, and CLOSED. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks Quan, updated
Signed-off-by: heanlan <[email protected]>
1ed4c16
to
8089b7e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@heanlan Some aggregator tests always failed in my local testbed because the following errors:
I ran same tests for other PRs, they can pass most of times. So I'm not sure if it's related to this PR. |
Thanks Quan. I haven't seen this error before in my local vagrant env. And I triggered another run last night, it also passed:
It seems the e2e test was not running with the CI fix-patch. Does the error only happen in LocalServiceAccess? We might need to dump more log in order to triage the fail run in your local env. e.g. Add a line in L578 in flowaggregator_test: t.Log(record) to see the received IPFIX records. The throughput field can be 0 when all the received records having the same timestamp. |
/test-e2e |
@heanlan This is the previous e2e log I collected: e2e.log I reran the test with the line added: e2e-2.log, and the antrea component logs: antrea-test-506706938.tar.gz |
/test-e2e |
Thanks Quan.
hanlan/flow-aggregator:latest
|
/test-e2e |
1 similar comment
/test-e2e |
@heanlan thanks for pointing out the issue in my local testbed and sorry for forgetting to rebuild flow-aggregator image. I have reran the test and it passed. |
…ntrea-io#2692) Signed-off-by: heanlan <[email protected]>
Flow-visibility solution is moving away from elk flow collector, we are moving throughput calculation from logstash into flow-aggregator: vmware/go-ipfix#270
This PR makes the necessary changes on Antrea side corresponding to the changes in the above PR. More description can be found on the PR description of the above Go-ipfix PR.
fixes: #2211