Data leak with goleveldb backend? #374

mmindenhall · 2016-05-18T13:10:00Z

The application I'm working on runs on IoT gateways with limited resources (~100MB storage free). Therefore, I submitted #373 to be able to proactively reclaim disk space after deleting "expired" documents from indexes.

I wrote a test where I do the following:

Create and initialize a new index, take snapshot of size of index folder
Do the following 5 times:
Add 1000 documents to the index
Take snapshot of size of index folder
Delete all documents from the index
Call the new Compact method I added in Add compact method to goleveldb store #373
Take snapshot of size of index folder

Here are the sizes of the index folder:

Size at start: 52K

Iteration	After adding 1K docs	After delete / compact
1	6.7M	876K
2	7.4M	1.6M
3	8.1M	2.2M
4	8.7M	2.9M
5	9.4M	3.5M

Just to be sure the Compact method was actually doing something, I commented out that line and ran again (create 1000, delete 1000, no compact). Without the call to Compact, deleting the 1000 documents actually increases the size of the index:

Iteration	After adding 1K docs	After delete only
1	6.6M	8.5M
2	7.4M	9.3M
3	8.0M	9.9M
4	8.6M	11M
5	9.4M	11M

Is this a bug? I'm wondering if there's some set of keys that gets created when a document is indexed that is not getting deleted when the document is removed?

The text was updated successfully, but these errors were encountered:

mschoch · 2016-05-18T13:13:04Z

Can you run the bleve_dump utility to see what data (if any) is still there?

If bleve_dump doesn't show anything, you can use the lower-level 'leveldbutil' that comes with leveldb to dump just the contents of the leveldb file.

Running those after one of the iterations should give some insight.

mschoch · 2016-05-18T13:15:14Z

There are a few things that are expected to be left around.

The index mapping itself is stored. (frequently this is less than 1k)
The field definitions exist independent of any documents being indexed. So there should be 1 row for each field, again in practice even if you had hundreds of these they are just a few bytes each.

mmindenhall · 2016-05-18T14:01:51Z

I ran the test with just one iteration, and with just 10 documents. After delete and compact, I ran bleve_dump. There are a huge number (1750, to be exact) of Dictionary Term entries:

Dictionary Term: `<^A^T'k^V3` Field: 2 Count: 0
Key:   64 02 00 3c 01 14 27 6b 16 33

Value: 00

Dictionary Term: `cpu2` Field: 4 Count: 0
Key:   64 04 00 63 70 75 32
Value: 00

Some have binary data (like the first), and some have recognizable strings (second). Then at the very bottom, I see the definitions for the fields:

Field: 0 Name: seq
Key:   66 00 00
Value: 73 65 71 ff

Field: 1 Name: id
Key:   66 01 00
Value: 69 64 ff

...

And finally I see the InternalStore which appears to contain the document mapping.

I noticed that all of the Dictionary Term entries have a Count: 0, which made me wonder if every term bleve encounters is being kept in the index, even after documents containing the terms have been removed?

I attached the dump output if that helps.

bleve_dump.txt.gz

mmindenhall · 2016-05-20T02:39:33Z

@mschoch, any thoughts on this one?

mschoch · 2016-05-20T02:44:38Z

I don't have anything really to add, what you found is definitely the case.

It has to do with the fact that we are using a "merge operator" to update the dictionary rows. We built our merge operator API to emulate (and fit in with) what RocksDB had. And at the time that didn't seem to allow us to delete a row as a result of the merge. I think there is now a way to get that behavior, but our API doesn't support it.

So, the net effect is that even if a dictionary term count goes to 0, we don't delete it. Even at the time we knew this was undesirable, but we also didn't expect it to be a big deal.

The use case your testing of creating docs then deleting them all isn't one we optimized for. Is this a particularly interesting case to you? Or are you just testing different things out?

My thought is to leave this open so that we eventually circle back to address it, but its not particularly high priority for me right now.

mmindenhall · 2016-05-20T03:48:45Z

We're building IoT gateway software that runs on small footprint devices. For example, we're running on a device with 128MB of RAM, and 128MB of flash (of which about 65% is available for our app + data). We're receiving and indexing "reports" from attached "things", and the partition runs out of space pretty quickly.

So I wrote code to monitor how much space is free, and delete the oldest reports (with a Compact) to avoid disk full errors. This solution probably extends our ability to run without filling the disk from hours or days to days or weeks. Eventually it would be great to have this fixed, but not urgent.

If you can give me some pointers at what needs to be done, I might be able to convince my boss to let me work on fixing this.

mschoch · 2016-05-20T18:28:01Z

Well, I don't think its straightforward to fix. Our current design is broken.

I did another quick review and it seems that the result of a RocksDB merge is always another row, not a delete. Even if we defined our merge wrapper to delete rows if the new row is nil, that wouldn't work correctly with RocksDB's native merge operator.

The only alternative I can see is to have some sort of background process cleaning things up, but I'm not excited about that solution either.

Remove DictionaryTerm with count 0 during compact (workaround for #374)

Compact for boltdb (workaround for #374)

salarali · 2016-08-02T18:19:25Z

I have a scenario where I add a doc, run queries on it and then delete it again because I do not need it anymore. This is done quite frequently.

Using leveldb as the backend and using the in-memory option when creating the index.

tmm1 · 2017-02-24T03:45:05Z

I've observed the behavior noted here (dictionary term not deleted when count goes to 0) on the levigo based leveldb backend as well.

mmindenhall mentioned this issue May 26, 2016

Remove DictionaryTerm with count 0 during compact (workaround for #374) #376

Merged

mschoch added a commit that referenced this issue Jun 1, 2016

Merge pull request #376 from MachineShop-IOT/master

92cf2a8

Remove DictionaryTerm with count 0 during compact (workaround for #374)

mmindenhall mentioned this issue Jun 1, 2016

Compact for boltdb (workaround for #374) #381

Merged

mschoch added a commit that referenced this issue Jun 8, 2016

Merge pull request #381 from MachineShop-IOT/master

1be5699

Compact for boltdb (workaround for #374)

mschoch added the bug label Jun 26, 2016

mschoch added this to the 1.0 milestone Jun 26, 2016

tmm1 mentioned this issue Feb 24, 2017

Add Compact() to leveldb and rocksdb stores blevesearch/blevex#30

Merged

tanzhang0504 linked a pull request Nov 26, 2019 that will close this issue

Remove dictionary rows when no document is referring to it #1314

Open

tanzhang0504 mentioned this issue Dec 10, 2019

Add a Compact() function to gtreap store. #1317

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data leak with goleveldb backend? #374

Data leak with goleveldb backend? #374

mmindenhall commented May 18, 2016

mschoch commented May 18, 2016

mschoch commented May 18, 2016

mmindenhall commented May 18, 2016

mmindenhall commented May 20, 2016

mschoch commented May 20, 2016

mmindenhall commented May 20, 2016

mschoch commented May 20, 2016

salarali commented Aug 2, 2016

tmm1 commented Feb 24, 2017

Data leak with goleveldb backend? #374

Data leak with goleveldb backend? #374

Comments

mmindenhall commented May 18, 2016

mschoch commented May 18, 2016

mschoch commented May 18, 2016

mmindenhall commented May 18, 2016

mmindenhall commented May 20, 2016

mschoch commented May 20, 2016

mmindenhall commented May 20, 2016

mschoch commented May 20, 2016

salarali commented Aug 2, 2016

tmm1 commented Feb 24, 2017