-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-1996: in-kernel de-duplication #470
Conversation
Skipping CI for Draft Pull Request. |
@jotak: This pull request references NETOBSERV-1996 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
bpf/flows.c
Outdated
return TC_ACT_OK; | ||
record->id = id; | ||
record->metrics = new_flow; | ||
bpf_ringbuf_submit(record, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the deduplication happening in the kernel, I would actually suggest dropping the ringbuf fallback entirely. Getting rid of it can simplify the code, especially on the userspace side. You'll have to have a well-defined way of dealing with the case where there are more flows than you can process, but sending events through the ringbuffer in this way uses more resources than the regular map-based path, so that's really not what you want to do when you are under pressure.
Now that you are deduplicating in the kernel, the map size is the maximum number of actual, logical flows that you can handle in a single update interval. And with a map size of 16M, if you overshoot that, I'm not sure there's a realistic way to recover. At some point you'll have to just throw up your hands and say, "sorry, I can't keep up".
If you do want to be a bit adaptive to load here, I would suggest having userspace vary its update interval. So if you observe that the map is close to being full (for whatever definition of "close" you like), you can lower the poll interval so it is cleared out more often. Assuming your backing data store can deal with variable-interval items, of course :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, this ringbuf is clearly double-edged. But before doing so I'd like to re-investigate actual reasons for switching to ringbuf, ie. if it's purely a matter of map capacity then I agree users can simply adapt the map config or accept the drops; but if there are other reasons, without map being at full capacity, perhaps it still makes sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @msherif1234 wdyt about that?
In my stress tests, the rb:hm ratio is typically at most 0.0001 ie. one flow emitted by RB for 10,000 flows emitted by hashmap
Looking at the perf-scale tests, this is quite similar, ingress-perf shows 0.0005 and cluster-density 0.00002
So, dropping the RB would likely be unnoticed anyway....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, if there's another reason why you may fall back to the ringbuffer, it could make sense to keep it. But with the dedup happening in the kernel, there really shouldn't be? E2BIG means the map is full, and ENOMEM means the system is under memory pressure. The one other error your comment mentions (EBUSY) can only happen when the map is concurrently accessed multiple times on the same CPU; which shouldn't happen from the TC hook.
So if you do see errors that trigger the fallback path, I would be very interested in understanding exactly why that happens :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, okay, that's really odd. Just to double check, which kernel version is this running on?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't check but it was an ocp 4.17 so rhel-9.4 if I'm correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be because of not using BPF_F_LOCK
while doing the lookup&delete on user space side? I'm just reading about it, I haven't updated the user-space lookup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I don't think it has anything to do with the use of BPF_F_LOCK (that's only for ensuring consistency when doing wholesale replacement of the map value, which I don't think you're doing at all?).
The issue is that this is basically the only place the hashtab code returns EBUSY: https://elixir.bootlin.com/linux/v6.12.1/source/kernel/bpf/hashtab.c#L164
...which only happens when two simultaneous accesses happen on the same CPU, i.e., where the BPF program is preempted by a non-maskable interrupt, and then a different BPF program accesses the same map from interrupt context. And I just don't see how this can happen when doing updates from TC this way.
It would be good to get to the bottom of this; to start with, can you try to record on which interfaces (and which traffic direction) this happens?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jotak: This pull request references NETOBSERV-1996 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #470 +/- ##
==========================================
- Coverage 30.01% 28.55% -1.46%
==========================================
Files 48 46 -2
Lines 4828 4703 -125
==========================================
- Hits 1449 1343 -106
+ Misses 3269 3254 -15
+ Partials 110 106 -4
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@jotak: This pull request references NETOBSERV-1996 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Perf results ~ -60% CPU and -10% memory But not as good with all feats enabled |
@jotak: This pull request references NETOBSERV-1996 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=efa8ae0 make set-agent-image |
Perf-scale test results: comparison, full data sheet => -33% agent CPU, -14% agent memory (user-space); and also +10% workload throughput. Compared to main, agent shows -62% CPU and -12% memory, +13% workload throughput. Altogether, overall netobserv user-space memory is slightly improved, -5% |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=afcc868 make set-agent-image |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold fixed: that was related to a model change using pointers... |
/unhold |
Thanks @msherif1234 @Amoghrd |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=1a3c38e make set-agent-image |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=30aaebb make set-agent-image |
…features Signed-off-by: Mohamed Mahmoud <[email protected]>
/ok-to-test |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=30aaebb make set-agent-image |
- Remove interface from the flow key; instead, use it as flow value. The first interface+dir seen for a given flow is the one taken into account for counters. Other interfaces+dirs are stored in a separate map for this flow. This algorithm is more or less the deduper algo that we had in userspace. - Remove user-space deduper - Adapt user-space model for the new interfaces+directions array provided directly from ebpf structs - Remove "decorator" (which was doing the interface naming). This is just for simplification. This enrichment is now done in a more straightforward way, when creating the Record objects try optimizing alignment
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=25c15d2 make set-agent-image |
@jotak: The following test failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
/label qe-approved |
Description
first interface+dir seen for a given flow is the one taken into account
for counters. Other interfaces+dirs are stored in a separate map for this
flow. This algorithm is more or less the deduper algo that we had in
userspace.
provided directly from ebpf structs
just for simplification. This enrichment is now done in a more
straightforward way, when creating the Record objects
(this PR is based on #466 and #469)
TODO:
Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.