Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1996: in-kernel de-duplication #470

Merged
merged 12 commits into from
Jan 8, 2025
Merged

Conversation

jotak
Copy link
Member

@jotak jotak commented Nov 29, 2024

Description

  • Remove interface from the flow key; instead, use it as flow value. The
    first interface+dir seen for a given flow is the one taken into account
    for counters. Other interfaces+dirs are stored in a separate map for this
    flow. This algorithm is more or less the deduper algo that we had in
    userspace.
  • Remove user-space deduper
  • Adapt user-space model for the new interfaces+directions array
    provided directly from ebpf structs
  • Remove "decorator" (which was doing the interface naming). This is
    just for simplification. This enrichment is now done in a more
    straightforward way, when creating the Record objects

(this PR is based on #466 and #469)

TODO:

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Copy link

openshift-ci bot commented Nov 29, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Nov 29, 2024

@jotak: This pull request references NETOBSERV-1996 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Description

  • Remove interface from the flow key; instead, use it as flow value. The
    first interface+dir seen for a given flow is the one taken into account
    for counters. Other interfaces+dirs are stored in a separate map for this
    flow. This algorithm is more or less the deduper algo that we had in
    userspace.
  • Remove user-space deduper
  • Adapt user-space model for the new interfaces+directions array
    provided directly from ebpf structs
  • Remove "decorator" (which was doing the interface naming). This is
    just for simplification. This enrichment is now done in a more
    straightforward way, when creating the Record objects

(this PR is based on #466 and #469)

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

bpf/flows.c Show resolved Hide resolved
bpf/flows.c Outdated
return TC_ACT_OK;
record->id = id;
record->metrics = new_flow;
bpf_ringbuf_submit(record, 0);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the deduplication happening in the kernel, I would actually suggest dropping the ringbuf fallback entirely. Getting rid of it can simplify the code, especially on the userspace side. You'll have to have a well-defined way of dealing with the case where there are more flows than you can process, but sending events through the ringbuffer in this way uses more resources than the regular map-based path, so that's really not what you want to do when you are under pressure.

Now that you are deduplicating in the kernel, the map size is the maximum number of actual, logical flows that you can handle in a single update interval. And with a map size of 16M, if you overshoot that, I'm not sure there's a realistic way to recover. At some point you'll have to just throw up your hands and say, "sorry, I can't keep up".

If you do want to be a bit adaptive to load here, I would suggest having userspace vary its update interval. So if you observe that the map is close to being full (for whatever definition of "close" you like), you can lower the poll interval so it is cleared out more often. Assuming your backing data store can deal with variable-interval items, of course :)

Copy link
Member Author

@jotak jotak Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, this ringbuf is clearly double-edged. But before doing so I'd like to re-investigate actual reasons for switching to ringbuf, ie. if it's purely a matter of map capacity then I agree users can simply adapt the map config or accept the drops; but if there are other reasons, without map being at full capacity, perhaps it still makes sense?

Copy link
Member Author

@jotak jotak Nov 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @msherif1234 wdyt about that?
In my stress tests, the rb:hm ratio is typically at most 0.0001 ie. one flow emitted by RB for 10,000 flows emitted by hashmap
Looking at the perf-scale tests, this is quite similar, ingress-perf shows 0.0005 and cluster-density 0.00002
So, dropping the RB would likely be unnoticed anyway....

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, if there's another reason why you may fall back to the ringbuffer, it could make sense to keep it. But with the dedup happening in the kernel, there really shouldn't be? E2BIG means the map is full, and ENOMEM means the system is under memory pressure. The one other error your comment mentions (EBUSY) can only happen when the map is concurrently accessed multiple times on the same CPU; which shouldn't happen from the TC hook.

So if you do see errors that trigger the fallback path, I would be very interested in understanding exactly why that happens :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I have that metric:
image
So, this is still "device or resource busy"

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, okay, that's really odd. Just to double check, which kernel version is this running on?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't check but it was an ocp 4.17 so rhel-9.4 if I'm correct

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could it be because of not using BPF_F_LOCK while doing the lookup&delete on user space side? I'm just reading about it, I haven't updated the user-space lookup

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I don't think it has anything to do with the use of BPF_F_LOCK (that's only for ensuring consistency when doing wholesale replacement of the map value, which I don't think you're doing at all?).

The issue is that this is basically the only place the hashtab code returns EBUSY: https://elixir.bootlin.com/linux/v6.12.1/source/kernel/bpf/hashtab.c#L164

...which only happens when two simultaneous accesses happen on the same CPU, i.e., where the BPF program is preempted by a non-maskable interrupt, and then a different BPF program accesses the same map from interrupt context. And I just don't see how this can happen when doing updates from TC this way.

It would be good to get to the bottom of this; to start with, can you try to record on which interfaces (and which traffic direction) this happens?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds like it can be any interface and direction:
image
So I have some ens5, br-ex, genev_sys_6081, 654cc34c0d2792b@if2 (ie. a pod interface)

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Nov 29, 2024

@jotak: This pull request references NETOBSERV-1996 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Description

  • Remove interface from the flow key; instead, use it as flow value. The
    first interface+dir seen for a given flow is the one taken into account
    for counters. Other interfaces+dirs are stored in a separate map for this
    flow. This algorithm is more or less the deduper algo that we had in
    userspace.
  • Remove user-space deduper
  • Adapt user-space model for the new interfaces+directions array
    provided directly from ebpf structs
  • Remove "decorator" (which was doing the interface naming). This is
    just for simplification. This enrichment is now done in a more
    straightforward way, when creating the Record objects

(this PR is based on #466 and #469)

TODO:

  • Move DNS info out of the main map. New map of its own? Maybe split the maps even more (1 per feature)
  • Think more about map sizing (should secondary maps have same max size as main map?)

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

codecov bot commented Nov 29, 2024

Codecov Report

Attention: Patch coverage is 83.07692% with 11 lines in your changes missing coverage. Please review.

Project coverage is 28.55%. Comparing base (4d8f9d9) to head (9852c9b).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/tracer/tracer.go 0.00% 6 Missing ⚠️
pkg/tracer/tracer_legacy.go 0.00% 3 Missing ⚠️
pkg/exporter/ipfix.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #470      +/-   ##
==========================================
- Coverage   30.01%   28.55%   -1.46%     
==========================================
  Files          48       46       -2     
  Lines        4828     4703     -125     
==========================================
- Hits         1449     1343     -106     
+ Misses       3269     3254      -15     
+ Partials      110      106       -4     
Flag Coverage Δ
unittests 28.55% <83.07%> (-1.46%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
pkg/agent/agent.go 31.09% <100.00%> (-1.76%) ⬇️
pkg/agent/config.go 8.33% <ø> (ø)
pkg/agent/packets_agent.go 0.00% <ø> (ø)
pkg/decode/decode_protobuf.go 30.11% <100.00%> (-1.80%) ⬇️
pkg/ebpf/bpf_x86_bpfel.go 0.00% <ø> (ø)
pkg/flow/account.go 85.00% <100.00%> (+0.25%) ⬆️
pkg/model/flow_content.go 68.46% <100.00%> (+3.76%) ⬆️
pkg/model/record.go 85.07% <100.00%> (+5.07%) ⬆️
pkg/exporter/ipfix.go 0.00% <0.00%> (ø)
pkg/tracer/tracer_legacy.go 0.00% <0.00%> (ø)
... and 1 more

... and 2 files with indirect coverage changes

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Nov 29, 2024

@jotak: This pull request references NETOBSERV-1996 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Description

  • Remove interface from the flow key; instead, use it as flow value. The
    first interface+dir seen for a given flow is the one taken into account
    for counters. Other interfaces+dirs are stored in a separate map for this
    flow. This algorithm is more or less the deduper algo that we had in
    userspace.
  • Remove user-space deduper
  • Adapt user-space model for the new interfaces+directions array
    provided directly from ebpf structs
  • Remove "decorator" (which was doing the interface naming). This is
    just for simplification. This enrichment is now done in a more
    straightforward way, when creating the Record objects

(this PR is based on #466 and #469)

TODO:

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak
Copy link
Member Author

jotak commented Nov 29, 2024

Perf results ~ -60% CPU and -10% memory

image

But not as good with all feats enabled

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Dec 3, 2024

@jotak: This pull request references NETOBSERV-1996 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

Description

  • Remove interface from the flow key; instead, use it as flow value. The
    first interface+dir seen for a given flow is the one taken into account
    for counters. Other interfaces+dirs are stored in a separate map for this
    flow. This algorithm is more or less the deduper algo that we had in
    userspace.
  • Remove user-space deduper
  • Adapt user-space model for the new interfaces+directions array
    provided directly from ebpf structs
  • Remove "decorator" (which was doing the interface naming). This is
    just for simplification. This enrichment is now done in a more
    straightforward way, when creating the Record objects

(this PR is based on #466 and #469)

TODO:

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 4, 2024
Copy link

github-actions bot commented Dec 4, 2024

New image:
quay.io/netobserv/netobserv-ebpf-agent:efa8ae0

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=efa8ae0 make set-agent-image

@jotak
Copy link
Member Author

jotak commented Dec 4, 2024

Perf-scale test results: comparison, full data sheet
Comparison is done with #469
And I added a comparison with main / #466

=> -33% agent CPU, -14% agent memory (user-space); and also +10% workload throughput. Compared to main, agent shows -62% CPU and -12% memory, +13% workload throughput. Altogether, overall netobserv user-space memory is slightly improved, -5%

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 9, 2024
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 9, 2024
@jotak jotak requested a review from msherif1234 December 9, 2024 10:34
Copy link

github-actions bot commented Dec 9, 2024

New image:
quay.io/netobserv/netobserv-ebpf-agent:afcc868

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=afcc868 make set-agent-image

Copy link

openshift-ci bot commented Dec 12, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jotak. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 12, 2024
@jotak
Copy link
Member Author

jotak commented Dec 12, 2024

/hold
something wrong happened during a rebase; it produces much higher output than it should. That's what the e2e tests are saying and it's also something I saw manually.
Earlier version (before rebases) was working fine

fixed: that was related to a model change using pointers...

@jotak
Copy link
Member Author

jotak commented Dec 12, 2024

/unhold

@jotak
Copy link
Member Author

jotak commented Dec 20, 2024

Thanks @msherif1234 @Amoghrd
I also updated the doc diagram

@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 20, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:1a3c38e

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=1a3c38e make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 20, 2024
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Dec 20, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:30aaebb

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=30aaebb make set-agent-image

@Amoghrd Amoghrd removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 7, 2025
@Amoghrd
Copy link

Amoghrd commented Jan 7, 2025

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 7, 2025
Copy link

github-actions bot commented Jan 7, 2025

New image:
quay.io/netobserv/netobserv-ebpf-agent:30aaebb

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=30aaebb make set-agent-image

jotak added 11 commits January 7, 2025 16:57
- Remove interface from the flow key; instead, use it as flow value. The
  first interface+dir seen for a given flow is the one taken into account
for counters. Other interfaces+dirs are stored in a separate map for this
flow. This algorithm is more or less the deduper algo that we had in
userspace.
- Remove user-space deduper
- Adapt user-space model for the new interfaces+directions array
  provided directly from ebpf structs
- Remove "decorator" (which was doing the interface naming). This is
  just for simplification. This enrichment is now done in a more
straightforward way, when creating the Record objects

try optimizing alignment
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 7, 2025
@jotak jotak added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jan 7, 2025
Copy link

github-actions bot commented Jan 7, 2025

New image:
quay.io/netobserv/netobserv-ebpf-agent:25c15d2

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=25c15d2 make set-agent-image

Copy link

openshift-ci bot commented Jan 7, 2025

@jotak: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/qe-e2e-tests 6b89456 link false /test qe-e2e-tests

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@Amoghrd
Copy link

Amoghrd commented Jan 7, 2025

/label qe-approved

@jotak jotak merged commit 7c29f2a into netobserv:main Jan 8, 2025
7 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants