Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NETOBSERV-1697: Add retry around netlinkSubscribeAt #358

Merged
merged 1 commit into from
Jun 28, 2024

Conversation

msherif1234
Copy link
Contributor

@msherif1234 msherif1234 commented Jun 25, 2024

Description

while netns is getting it was noticed the associated netnsHandle keep changing for sometime then become stable and at that point netlinkSubscribeAt will succeed

so added retry loop and avoid the early creation of netnshandle till things become more stable

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
    • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
    • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
    • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
    • Standard QE validation, with pre-merge tests unless stated otherwise.
    • Regression tests only (e.g. refactoring with no user-facing change).
    • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 25, 2024

@msherif1234: This pull request references NETOBSERV-1697 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

codecov bot commented Jun 25, 2024

Codecov Report

Attention: Patch coverage is 43.47826% with 52 lines in your changes missing coverage. Please review.

Please upload report for BASE (main@fdebe3f). Learn more about missing BASE report.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #358   +/-   ##
=======================================
  Coverage        ?   33.38%           
=======================================
  Files           ?       48           
  Lines           ?     3531           
  Branches        ?        0           
=======================================
  Hits            ?     1179           
  Misses          ?     2251           
  Partials        ?      101           
Flag Coverage Δ
unittests 33.38% <43.47%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
pkg/ifaces/informer.go 0.00% <0.00%> (ø)
pkg/ifaces/poller.go 82.45% <67.74%> (ø)
pkg/ebpf/tracer.go 0.00% <0.00%> (ø)
pkg/ifaces/watcher.go 61.29% <51.35%> (ø)

@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 25, 2024

@msherif1234: This pull request references NETOBSERV-1697 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description

while netns is getting it was noticed the associated netnsHandle keep changing for sometime then become stable and at that point netlinkSubscribeAt will succeed

so added retry loop and avoid the early creation of netnshandle till things become more stable

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@msherif1234 msherif1234 force-pushed the dbg_sriov branch 2 times, most recently from 45ca122 to 339b375 Compare June 25, 2024 19:35
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 25, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:f437640

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=f437640 make set-agent-image

Comment on lines +54 to +55
for _, n := range netns {
go w.sendUpdates(ctx, n, out)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're removing the capturing of n which was done on purpose .. is it intentional? (I trying to understand if this issue is fixed in go1.22, cf https://go.dev/blog/loopvar-preview , but the blog pre-dates the 1.22 release so I'm not sure)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I dropped it after moving to go1.22 as I believe this behavior if fixed now

Copy link
Member

@jotak jotak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code lgtm
I believe it's now ok to stop capturing loop variables for goroutines

@openshift-ci openshift-ci bot added the lgtm label Jun 26, 2024
@openshift-ci openshift-ci bot removed the lgtm label Jun 26, 2024
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 26, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 26, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:86853c3

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=86853c3 make set-agent-image

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 26, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 26, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:66f70cd

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=66f70cd make set-agent-image

@msherif1234 msherif1234 changed the title NETOBSERV-1697: Add retry around netlinkSubscribeAt WIP: NETOBSERV-1697: Add retry around netlinkSubscribeAt Jun 26, 2024
func (w *Watcher) sendUpdates(ctx context.Context, netnsHandle netns.NsHandle, out chan Event) {
func (w *Watcher) sendUpdates(ctx context.Context, ns string, out chan Event) {
var netnsHandle netns.NsHandle
var err error
log := logrus.WithField("component", "ifaces.Watcher")
// subscribe for interface events
links := make(chan netlink.LinkUpdate)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
links := make(chan netlink.LinkUpdate)
links := make(chan netlink.LinkUpdate)
doneChan := make(chan struct{})

"netns": ns,
"netnsHandle": netnsHandle.String(),
}).Debug("linkSubscribe to receive links update")
if err = w.linkSubscriberAt(netnsHandle, links, ctx.Done()); err != nil {
Copy link

@anfredette anfredette Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a new locally defined Done channel seems to fix this. Perhaps ctx.Done() is receiving a signal that terminates the subscription.

Suggested change
if err = w.linkSubscriberAt(netnsHandle, links, ctx.Done()); err != nil {
if err = w.linkSubscriberAt(netnsHandle, links, doneChan); err != nil {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes this indeed helped local repro steps thanks for catching this

log.WithField("netns", netnsHandle.String()).Debug("linkSubscribe to receive links update")
if err := w.linkSubscriberAt(netnsHandle, links, ctx.Done()); err != nil {
log.WithError(err).Errorf("can't subscribe to links netns %s", netnsHandle.String())
if err = wait.PollUntilContextTimeout(ctx, 10*time.Microsecond, time.Second, true, func(ctx context.Context) (done bool, err error) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if err = wait.PollUntilContextTimeout(ctx, 10*time.Microsecond, time.Second, true, func(ctx context.Context) (done bool, err error) {
if err = wait.PollUntilContextTimeout(ctx, 50*time.Microsecond, time.Second, true, func(ctx context.Context) (done bool, err error) {

This is taking quite a few tries on my system at 10 usec. This is your choice, but you might want to try bumping this up to avoid all the retries.

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 26, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:ae956e1

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=ae956e1 make set-agent-image

@jotak
Copy link
Member

jotak commented Jun 27, 2024

Thanks!
/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jun 27, 2024
@openshift-ci openshift-ci bot removed the lgtm label Jun 27, 2024
@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 27, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 27, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:d8a6aed

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=d8a6aed make set-agent-image

@memodi
Copy link
Contributor

memodi commented Jun 27, 2024

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved QE has approved this pull request label Jun 27, 2024
@openshift-ci-robot
Copy link
Collaborator

openshift-ci-robot commented Jun 27, 2024

@msherif1234: This pull request references NETOBSERV-1697 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set.

In response to this:

Description

while netns is getting it was noticed the associated netnsHandle keep changing for sometime then become stable and at that point netlinkSubscribeAt will succeed

so added retry loop and avoid the early creation of netnshandle till things become more stable

Dependencies

n/a

Checklist

If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.

  • Will this change affect NetObserv / Network Observability operator? If not, you can ignore the rest of this checklist.
  • Is this PR backed with a JIRA ticket? If so, make sure it is written as a title prefix (in general, PRs affecting the NetObserv/Network Observability product should be backed with a JIRA ticket - especially if they bring user facing changes).
  • Does this PR require product documentation?
  • If so, make sure the JIRA epic is labelled with "documentation" and provides a description relevant for doc writers, such as use cases or scenarios. Any required step to activate or configure the feature should be documented there, such as new CRD knobs.
  • Does this PR require a product release notes entry?
  • If so, fill in "Release Note Text" in the JIRA.
  • Is there anything else the QE team should know before testing? E.g: configuration changes, environment setup, etc.
  • If so, make sure it is described in the JIRA ticket.
  • QE requirements (check 1 from the list):
  • Standard QE validation, with pre-merge tests unless stated otherwise.
  • Regression tests only (e.g. refactoring with no user-facing change).
  • No QE (e.g. trivial change with high reviewer's confidence, or per agreement with the QE team).

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@github-actions github-actions bot removed the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 28, 2024
@msherif1234
Copy link
Contributor Author

/ok-to-test

@openshift-ci openshift-ci bot added the ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. label Jun 28, 2024
Copy link

New image:
quay.io/netobserv/netobserv-ebpf-agent:53fa344

It will expire after two weeks.

To deploy this build, run from the operator repo, assuming the operator is running:

USER=netobserv VERSION=53fa344 make set-agent-image

@jotak
Copy link
Member

jotak commented Jun 28, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label Jun 28, 2024
@msherif1234
Copy link
Contributor Author

/approve

Copy link

openshift-ci bot commented Jun 28, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msherif1234

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit 58e5d37 into netobserv:main Jun 28, 2024
12 checks passed
@memodi
Copy link
Contributor

memodi commented Jun 28, 2024

@msherif1234 - new commits were added after completing the testing, are those additional fixes?

jotak pushed a commit to jotak/netobserv-agent that referenced this pull request Jul 15, 2024
jotak pushed a commit that referenced this pull request Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved jira/valid-reference lgtm ok-to-test To set manually when a PR is safe to test. Triggers image build on PR. qe-approved QE has approved this pull request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants