-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NETOBSERV-1697: Add retry around netlinkSubscribeAt #358
Conversation
@msherif1234: This pull request references NETOBSERV-1697 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #358 +/- ##
=======================================
Coverage ? 33.38%
=======================================
Files ? 48
Lines ? 3531
Branches ? 0
=======================================
Hits ? 1179
Misses ? 2251
Partials ? 101
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@msherif1234: This pull request references NETOBSERV-1697 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
45ca122
to
339b375
Compare
/ok-to-test |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=f437640 make set-agent-image |
for _, n := range netns { | ||
go w.sendUpdates(ctx, n, out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're removing the capturing of n
which was done on purpose .. is it intentional? (I trying to understand if this issue is fixed in go1.22, cf https://go.dev/blog/loopvar-preview , but the blog pre-dates the 1.22 release so I'm not sure)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I dropped it after moving to go1.22 as I believe this behavior if fixed now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code lgtm
I believe it's now ok to stop capturing loop variables for goroutines
/ok-to-test |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=86853c3 make set-agent-image |
/ok-to-test |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=66f70cd make set-agent-image |
func (w *Watcher) sendUpdates(ctx context.Context, netnsHandle netns.NsHandle, out chan Event) { | ||
func (w *Watcher) sendUpdates(ctx context.Context, ns string, out chan Event) { | ||
var netnsHandle netns.NsHandle | ||
var err error | ||
log := logrus.WithField("component", "ifaces.Watcher") | ||
// subscribe for interface events | ||
links := make(chan netlink.LinkUpdate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
links := make(chan netlink.LinkUpdate) | |
links := make(chan netlink.LinkUpdate) | |
doneChan := make(chan struct{}) |
pkg/ifaces/watcher.go
Outdated
"netns": ns, | ||
"netnsHandle": netnsHandle.String(), | ||
}).Debug("linkSubscribe to receive links update") | ||
if err = w.linkSubscriberAt(netnsHandle, links, ctx.Done()); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a new locally defined Done channel seems to fix this. Perhaps ctx.Done() is receiving a signal that terminates the subscription.
if err = w.linkSubscriberAt(netnsHandle, links, ctx.Done()); err != nil { | |
if err = w.linkSubscriberAt(netnsHandle, links, doneChan); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this indeed helped local repro steps thanks for catching this
pkg/ifaces/watcher.go
Outdated
log.WithField("netns", netnsHandle.String()).Debug("linkSubscribe to receive links update") | ||
if err := w.linkSubscriberAt(netnsHandle, links, ctx.Done()); err != nil { | ||
log.WithError(err).Errorf("can't subscribe to links netns %s", netnsHandle.String()) | ||
if err = wait.PollUntilContextTimeout(ctx, 10*time.Microsecond, time.Second, true, func(ctx context.Context) (done bool, err error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err = wait.PollUntilContextTimeout(ctx, 10*time.Microsecond, time.Second, true, func(ctx context.Context) (done bool, err error) { | |
if err = wait.PollUntilContextTimeout(ctx, 50*time.Microsecond, time.Second, true, func(ctx context.Context) (done bool, err error) { |
This is taking quite a few tries on my system at 10 usec. This is your choice, but you might want to try bumping this up to avoid all the retries.
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=ae956e1 make set-agent-image |
Thanks! |
/ok-to-test |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=d8a6aed make set-agent-image |
/label qe-approved |
@msherif1234: This pull request references NETOBSERV-1697 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target the "4.17.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Signed-off-by: Mohamed Mahmoud <[email protected]>
/ok-to-test |
New image: It will expire after two weeks. To deploy this build, run from the operator repo, assuming the operator is running: USER=netobserv VERSION=53fa344 make set-agent-image |
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msherif1234 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@msherif1234 - new commits were added after completing the testing, are those additional fixes? |
Signed-off-by: Mohamed Mahmoud <[email protected]>
Signed-off-by: Mohamed Mahmoud <[email protected]>
Description
while netns is getting it was noticed the associated netnsHandle keep changing for sometime then become stable and at that point
netlinkSubscribeAt
will succeedso added retry loop and avoid the early creation of netnshandle till things become more stable
Dependencies
n/a
Checklist
If you are not familiar with our processes or don't know what to answer in the list below, let us know in a comment: the maintainers will take care of that.