Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java process terminates saying it could not free a specific address #51

Open
gknuppe opened this issue Oct 16, 2017 · 11 comments
Open

Java process terminates saying it could not free a specific address #51

gknuppe opened this issue Oct 16, 2017 · 11 comments

Comments

@gknuppe
Copy link

gknuppe commented Oct 16, 2017

When running java with MDS framework for a long period, java process terminates with the following message in the stdout of the console that launched the application:

*** Error in `/usr/bin/java': free(): invalid pointer: 0x00007fe6580016c0 ***

There is no log message about this problem.

After further investigation debugging with gdb, we noticed the free is called from gc code from MDS.

@lokeshgidra
Copy link
Collaborator

It is happening in the local_snapshot_map::processed function in include/core/core_globals.h header file. The erase call to unprocessed map is sometimes leading to invalid pointer being freed.

It looks to me that 'from' and 'to' may have to be swapped in some cases to avoid the issue.

@EvanKirshenbaum
Copy link
Collaborator

It doesn't look like from and to need to be swapped. Bear in mind that the maps use std::greater<timestamp_t> as their comparators, so the iterators run from largest timestamp to earliest. (I did that so that p++ could be used to go to successively earlier timestamp records.

What doesn't make sense to me is the

if (ts >= later)

test on line 234. That should almost certainly be

if (ts <= later)

or perhaps

if (ts < later)

But I don't see any reason why you should ever get an invalid iterator for the erase() call. from has to be valid as it's always first_iterator which is the result of calling either lower_bound() or upper_bound(). And to is also valid, as it's the result of calling upper_bound. The only question would be whether it would be possible for last_seen to come before from. The only way that ould seem to be able to happen is if the timestamp_record's chain doesn't go in descending timestamp order, which would seem to be impossible.

I can't actually debug this with my current setup. But if changing the successful termination test doesn't fix things, I'd put in some traces in the snapshot_between() and processed() functions and see what timestamps we're dealing with. Post such a trace here, and I'll take a look.

@lokeshgidra
Copy link
Collaborator

lokeshgidra commented Oct 30, 2017

AFAIK, the crash is always when processed() function is called at line 243. So, I am not sure if the test at line 234 has any consequences.

I agree that the iterators 'from' and 'to' are both valid in processed() function. But what if unprocessed.erase() function never manages to reach 'to' iterator, starting from 'from' iterator? It might be reaching the end() iterator, and its dereference could be causing the bug. I think that's what is happening in cases when we get the error, because the change that @gkuppe tried, seems to work.

@gkuppe can you please post the change that you tried in the processed() function to avoid this bug?

@EvanKirshenbaum
Copy link
Collaborator

First off, "avoiding" the bug is easy. Just have snapshot_between() return true. This won't affect correctness. It will simply prevent pruning of value nodes.

The test at line 234, on the other hand, will affect correctness. It's the only place where snapshot_between() returns that it found a snapshot and that, therefore, pruning should not take place. Without it, every time a new value is asserted (e.g., by roll-up) in a parent context, it will replace the prior value, even if there is an outstanding snapshot that cares about that value.

As for the scenario you describe, note that to is obtained within processed(), at line 141. So we know that to is a valid iterator, and the only way we can have a problem is if that iterator is somehow before from (i.e., from's timestamp is before the timestamp passed in).

Which, come to think of it is exactly what's happening! Since we're not correctly short-circuiting at line 234, we're missing the timestamp in the range, and first_iterator is winding up pointing to an earlier timestamp_record or the end() of the map. But when we call processed(), its upper_bound() call finds the timestamp_record we should have stopped on, which is before first_iterator, and so from and to wind up out of order and the erase() fails.

@EvanKirshenbaum
Copy link
Collaborator

Looking at the code, there's one more change I'd like to make. These maps are thread-local, and the only thing that appears to put a timestamp_record in unprocessed is current_value_timestamp(). It's possible that my logic was that without calling current_value_timestamp(), there's no way that any timestamp_record not in unprocessed could be in the range you're looking for, but I'm not sure I believe that.

To be safe, snapshot_between(), on line 175 should probably begin with

note(*_current_timestamp);

If we've already seen it, the test at line 148 will short-circuit it with no work being done. Otherwise, we'll be sure we're up-to-date as of the call to snapshot_between().

@lokeshgidra
Copy link
Collaborator

So if I understand you correct, you are saying to make following 2 changes and test again:

  1. change condition on line 234 to 'if (ts < later)'
  2. add 'note(*_current_timestamp);' in the beginning of snapshot_between() function.

Is that right?

@EvanKirshenbaum
Copy link
Collaborator

Yeah. It's possible that it should be ts <= later, but I think that the logic is that we only care about snapshots that occur before the value we're asserting, so less-than would appear to be right.

snapshot_between() only gets called in two places, both in core_typed_msv.cpp. At line 932, in ensure_value(), when we explicitly add a value, and at line 771, in receive_rollup(), when a value is rolled up. In both cases, the question is "Can we patch around the most recent value?"

@lokeshgidra
Copy link
Collaborator

OK. @gkuppe is going to get these two changes tested, and then inform here. Thanks.

@gknuppe
Copy link
Author

gknuppe commented Oct 31, 2017

Ok, I'm testing. I had to move some externs declaration above in order to allow recognizing the "_current_timestamp" variable.
So far so good, but since it's pretty hard to preproduce (take more than 2 hours), we'll have to wait for one day more or less.
I'm main concernted about issue #53. This is more often reproducible while this one the "workaround" looks like working.
BTW: Here is the workaorund patch @lokeshgidra requested:

diff --git a/include/core/core_globals.h b/include/core/core_globals.h
index d1475a3..6ec063f 100644
--- a/include/core/core_globals.h
+++ b/include/core/core_globals.h
@@ -139,11 +139,7 @@ namespace mds
       std::map<timestamp_t, external_gc_ptr<timestamp_record>, std::greater<timestamp_t>> unprocessed;
       void processed(const auto &from, timestamp_t last_seen) {
         auto to = unprocessed.upper_bound(last_seen);
-        if (to == unprocessed.end() || from->first >= to->first) {
-          unprocessed.erase(from, to);
-        } else {
-          unprocessed.erase(to, from);
-        }
+        unprocessed.erase(from, to);
       }
     public:
       void note(gc_ptr<timestamp_record> tsr) {

@EvanKirshenbaum
Copy link
Collaborator

That "workaround" will certainly make this crash disappear, but it does it by masking the real problem and erasing the wrong elements from the map. I would recommend removing it. The fix I recommended should (I hope) solve the problem, but if it doesn't the workaround will prevent you from seeing it.

@gknuppe
Copy link
Author

gknuppe commented Nov 1, 2017

Problem reproduced with proposed fix:
attached is the backtrace:
backtrace.txt

Here is the patch I used:

diff --git a/include/core/core_globals.h b/include/core/core_globals.h
index 6ec063f..24a9f35 100644
--- a/include/core/core_globals.h
+++ b/include/core/core_globals.h
@@ -129,6 +129,15 @@ namespace mds
       }
     };

+    //    extern external_gc_ptr<control> control_block;
+    extern std::atomic<gc_ptr<timestamp_record>> *_current_timestamp;
+    extern external_gc_ptr<view> top_level_view;
+    extern external_gc_ptr<iso_context> global_context;
+    extern external_gc_ptr<string_table_t> string_table;
+    extern external_gc_ptr<record_type_table_t> record_type_table;
+    extern external_gc_ptr<name_space> global_namespace;
+    extern std::size_t *next_task_number;
+
     class local_snapshot_map {
       timestamp_t last_ts_seen = 0;
       timestamp_t last_snapshot_seen = 0;
@@ -173,6 +182,7 @@ namespace mds


       bool snapshot_between(timestamp_t later, timestamp_t earlier) {
+       note(*_current_timestamp);
         /*
          * We first loop from the first snapshot time we know about
          * that could be the one we're looking for.
@@ -231,7 +241,7 @@ namespace mds
                 return false;
               }
               snapshots[ts] = tsr->snapshot;
-              if (ts >= later) {
+              if (ts < later) {
                 /*
                  * TODO: We found one in the correct range.
                  */
@@ -247,15 +257,6 @@ namespace mds



-    //    extern external_gc_ptr<control> control_block;
-    extern std::atomic<gc_ptr<timestamp_record>> *_current_timestamp;
-    extern external_gc_ptr<view> top_level_view;
-    extern external_gc_ptr<iso_context> global_context;
-    extern external_gc_ptr<string_table_t> string_table;
-    extern external_gc_ptr<record_type_table_t> record_type_table;
-    extern external_gc_ptr<name_space> global_namespace;
-    extern std::size_t *next_task_number;
-
     inline timestamp_t current_value_timestamp() {
       const gc_ptr<timestamp_record> cts = *_current_timestamp;
       local_snapshot_map::for_thread().note(cts);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants