Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in (S,G) state when neighbor is lost #22

Closed
YileKu opened this issue Oct 2, 2013 · 25 comments
Closed

Crash in (S,G) state when neighbor is lost #22

YileKu opened this issue Oct 2, 2013 · 25 comments
Milestone

Comments

@YileKu
Copy link

YileKu commented Oct 2, 2013

I had to add the following code to the bottom of delete_pim_nbr():

 for (cand_rp_ptr = cand_rp_list; cand_rp_ptr != (cand_rp_t *)NULL;
     cand_rp_ptr = cand_rp_ptr->next) {
    for (rp_grp_entry_ptr = cand_rp_ptr->rp_grp_next;
        rp_grp_entry_ptr != (rp_grp_entry_t *)NULL;
        rp_grp_entry_ptr = rp_grp_entry_ptr->rp_grp_next)
        {
       for (grpentry_ptr = rp_grp_entry_ptr->grplink;
           grpentry_ptr != (grpentry_t *)NULL;
           grpentry_ptr = grpentry_ptr->next) {
            mrtentry_srcs = grpentry_ptr->grp_route;
            if (mrtentry_srcs != NULL ) {
              if(mrtentry_srcs->upstream != NULL) {
                  if (mrtentry_srcs->upstream == nbr_delete) {
                     mrtentry_srcs->upstream = NULL;
                  }
              }
         }
    }
}
}

age_routes was crashing because upstream had been freed, then someone else (timer_setTimer) was allocating the freed memory and clobbering the bjpm pointer.
Y-

@troglobit
Copy link
Owner

Thank you for taking the time to report this issue!

However, I cannot read the above code and would rather appreciate
a pull request, using the GitHub facilities, since that:

  1. Gives me an indication that you've actually tested the patch you propose
  2. Eases my work load with respect to:
    • Auditing the change
    • Lowers the effort I have to make to merge it to master

I look forward to your pull request! :)

@YileKu
Copy link
Author

YileKu commented Oct 2, 2013

I am not setup with git. Our development environment has lots of restrictions. I can discuss the bug with you.

Y

Sent from my iPhone

On Oct 2, 2013, at 12:38 PM, Joachim Nilsson [email protected] wrote:

Thank you for taking the time to report this issue!

However, I cannot read the above code and would rather appreciate
a pull request, using the GitHub facilities, since that:

Gives me an indication that you've actually tested the patch you propose
Eases my work load with respect to:
Auditing the change
Lowers the effort I have to make to merge it to master
I look forward to your pull request! :)


Reply to this email directly or view it on GitHub.

@troglobit
Copy link
Owner

I can fully understand that, I too have restrictions from my employer,
this is very common in our line of work.

Nevertheless, it is common practice in the Open Source world to at
least provide a proper patch, often in unidiff format, against the last
release of a project. Something like this:

user@host:~/Projects $ diff -u pimd-2.1.8/ pimd-2.1.8-fixes > pimd-lost-nbr.patch

This eases the work load of the maintainer (me) and provides a very
clear image of what the change does and in what context.

It also speeds up the process of getting changes into the mainline.

Thanks
/Joachim

@troglobit
Copy link
Owner

Looking at the delete_pim_nbr() function it seems that there
is code in there for what you want to achieve already.

Have you made any analysis of why that code does not run?

(Also, that function looks terrible and is in dire need of a bit
of a refactor and cleanup/simplification.)

@YileKu
Copy link
Author

YileKu commented Oct 3, 2013

It doesn't check for upstream pointers in the mcast grp entries so later age routes crashes.

Sent from my iPhone

On Oct 3, 2013, at 2:44 AM, Joachim Nilsson [email protected] wrote:

Looking at the delete_pim_nbr() function it seems that there
is code in there for what you want to achieve already.

Have you made any analysis of why that code does not run?

(Also, that function looks terrible and is in dire need of a bit
of a refactor and cleanup/simplification.)


Reply to this email directly or view it on GitHub.

@YileKu
Copy link
Author

YileKu commented Oct 6, 2013

We are using pimd in an RF environment where network connections come and
go often and can be half-duplex (receiving packets from another server but
not being able to send them). age_routes crashes when it references an
upstream pointer that has previously been freed and then reallocated by
timer_setTimer. The bpjm pointer is then bad (like the value 3 or 4) and
then pimd page faults.

On Thu, Oct 3, 2013 at 12:53 PM, Yile Ku [email protected] wrote:

It doesn't check for upstream pointers in the mcast grp entries so later
age routes crashes.

Sent from my iPhone

On Oct 3, 2013, at 2:44 AM, Joachim Nilsson [email protected]
wrote:

Looking at the delete_pim_nbr() function it seems that there
is code in there for what you want to achieve already.

Have you made any analysis of why that code does not run?

(Also, that function looks terrible and is in dire need of a bit
of a refactor and cleanup/simplification.)


Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-25606741
.

troglobit added a commit that referenced this issue Dec 26, 2013
@troglobit
Copy link
Owner

I believe the fix in 2619b2a fixes the actual root cause for the bug you reported, and also affects
other aspects of the correct operation of pimd. If you have the opportunity to test this I'd be most
grateful.

@YileKu
Copy link
Author

YileKu commented Jan 6, 2014

Joachim,

I am a bit confused due to the differences in the source code. We have
pimd-2.1.8 and the source code at the patch location is listed below which
is different from your patch variable names.

Y-
/* Update the group entries for this RP /
for (rp_grp_entry_ptr = cand_rp_ptr->rp_grp_next;
rp_grp_entry_ptr != (rp_grp_entry_t *)NULL;
rp_grp_entry_ptr = rp_grp_entry_ptr->rp_grp_next) {
for (grpentry_ptr = rp_grp_entry_ptr->grplink;
grpentry_ptr != (grpentry_t *)NULL;
grpentry_ptr = grpentry_ptr->rpnext) {
mrtentry_ptr = grpentry_ptr->grp_route;
if (mrtentry_ptr != (mrtentry_t *)NULL) {
mrtentry_ptr->upstream = rpentry_ptr->upstream;
mrtentry_ptr->metric = rpentry_ptr->metric;
mrtentry_ptr->preference = rpentry_ptr->preference;
change_interfaces(mrtentry_ptr,
rpentry_ptr->incoming,
mrtentry_ptr->joined_oifs,
mrtentry_ptr->pruned_oifs,
mrtentry_ptr->leaves,
mrtentry_ptr->asserted_oifs, 0);
}
/
Update only the (S,G)RPbit entries for this group */
for (mrtentry_srcs = grpentry_ptr->mrtlink;
mrtentry_srcs != (mrtentry_t *)NULL;
mrtentry_srcs = mrtentry_srcs->grpnext) {
if (mrtentry_srcs->flags & MRTF_RP) {
mrtentry_ptr->upstream = rpentry_ptr->upstream;
mrtentry_ptr->metric = rpentry_ptr->metric;
mrtentry_ptr->preference = rpentry_ptr->preference;
change_interfaces(mrtentry_srcs,
rpentry_ptr->incoming,
mrtentry_srcs->joined_oifs,
mrtentry_srcs->pruned_oifs,
mrtentry_srcs->leaves,
mrtentry_srcs->asserted_oifs, 0);
}
}
}
}
}

On Thu, Dec 26, 2013 at 4:09 PM, Joachim Nilsson
[email protected]:

I believe the fix in 2619b2ahttps://github.com/troglobit/pimd/commit/2619b2ais fixes the actual root cause for the bug you reported, and also affects
other aspects of the correct operation of pimd. If you have the
opportunity to test this I'd be most
grateful.


Reply to this email directly or view it on GitHubhttps://github.com//issues/22#issuecomment-31239697
.

@troglobit
Copy link
Owner

Yes, I cleaned it up and renamed the variable names to make the code easier
to read. That's when I found what I believe to be the actual bug:

In the above snippet you posted, ask yourself why the last for-loop iterates using
the variable mrtentry_srcs, but then changes the properties of the mrtentry_ptr ...

For every lap in that last for-loop the same mrtentry_ptr in the outer for-loop
will have its upstream, metric and preference values altered. That not just seems
unlikely but is very likely a cut-and-paste error by the original author and the root
cause of your bug report.

@idismmxiv
Copy link
Contributor

I entered into same problem as YileKu at some point and tried both fixes. First simpler one (2619b2a) but it did not help. Then the one YileKu provided and that helped. Problem seems to occur if pimd is already in SG state and then neighbor is lost.

@troglobit troglobit changed the title pimd crashes when nbr goes away. Crash in (S,G) state when neighbor is lost Oct 6, 2014
troglobit added a commit that referenced this issue Oct 6, 2014
@troglobit
Copy link
Owner

Thank you for the report, @idismmxiv, that is very interesting! Since I've not been able to fully
reproduce this, could you perhaps enlighten double check that f522502, from the issue22-branch
solves the issue? If it does then I can merge it to master before release.

@idismmxiv
Copy link
Contributor

Patch looks correct one. I will test it at some point but unfortunately can not give you any schedule when. Still using old repo and cherry picking patches, so I have to find some time to check this one.

@troglobit
Copy link
Owner

No rush, I'm just grateful to have someone verify the fix since I've not been able to reproduce it myself.

@idismmxiv
Copy link
Contributor

Haven't been able to test with the actual patch troglobit made, as I am still working with old version of pimd. But same changes anyway as in your patch. And after running some more test cases shows still issues. Not so often than earlier, but ... After neigbor is lost pimd crashes and following stack trace is found from core:
#0 0x0805beec in add_jp_entry ()
#1 0x0805739e in age_routes ()
#2 0x0804e670 in timer ()
#3 0x080545d1 in age_callout_queue ()
#4 0x0804ee0f in main ()
Things are moving slowly forward, so will take some time before I can look into this properly.

@troglobit
Copy link
Owner

There have been more fixes to the tree that might be what you run into. It's impossible to know since you write "old version" without even specifying what version that is exactly. I do appreciate you testing this, but it's unfortunately almost useless if we do not share the same reference point ...

troglobit added a commit that referenced this issue Oct 26, 2014
@troglobit
Copy link
Owner

Merged to master, with a few other NULL pointer fixes as well ... this could be a good time to check the current master.

@idismmxiv
Copy link
Contributor

Looks like version 2.2.0 works just fine. Could not reproduce this error any more.

@troglobit
Copy link
Owner

@idismmxiv That is truly great news, then we can close this, thank you! :-)

@mfspeer
Copy link

mfspeer commented Mar 18, 2015

See issue #37

@troglobit
Copy link
Owner

Reopen due to #37.

@troglobit troglobit reopened this Mar 21, 2015
@mfspeer
Copy link

mfspeer commented Mar 23, 2015

In stop_vif(), I patched this:

if (!(v->uv_flags & VIFF_REGISTER)) {
    RESET_TIMER(v->uv_pim_hello_timer);
    RESET_TIMER(v->uv_jp_timer);
    RESET_TIMER(v->uv_gq_timer);

    for (n = v->uv_pim_neighbors; n != NULL; n = next) {
        next = n->next;     /* Free the space for each neighbour */
#if defined(PLURIBUS_BUG_FIX)
        delete_pim_nbr(n);
#else
        free(n);
#endif
    }
    v->uv_pim_neighbors = NULL;
 }

and in delete_pim_nbr(), I added this as well:

if (!v->uv_pim_neighbors) {
    /* This was our last neighbor. */
    v->uv_flags &= ~VIFF_PIM_NBR;
    v->uv_flags |= (VIFF_NONBRS | VIFF_DR);
#if defined(PLURIBUS_BUG_FIX)
    RESET_TIMER(v->uv_jp_timer);
#endif
 } else {
    if (ntohl(v->uv_lcl_addr) > ntohl(v->uv_pim_neighbors->address))
        /* The first address is the new potential remote
         * DR address, but the local address is the winner.
         */
        v->uv_flags |= VIFF_DR;
 }

This for the moment seems to have calmed the storm.

@mfspeer
Copy link

mfspeer commented Mar 23, 2015

I take it back, the change to delete_pim_nbr() is not required.

@mfspeer
Copy link

mfspeer commented Apr 1, 2015

Any update on suggested fix?

@troglobit
Copy link
Owner

Interesting, so all you changed was to replace free(n) with delete_pim_nbr(n)?
I could probably go with that.

@troglobit troglobit added this to the 2.2.1 milestone Apr 13, 2015
troglobit added a commit that referenced this issue Apr 20, 2015
Followup to 3aa829c, according to proposal in (the reopened) issue #22.

Signed-off-by: Joachim Nilsson <[email protected]>
@troglobit
Copy link
Owner

I sincerely hope 69a5e34 fixes this bug, once and for all!

I'm now closing this issue as fixed in v2.2.1, but if it creeps up again, please reopen. Cheers, and thanks for all the help debugging and coming up with this fix, @mfspeer :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants