Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rdb_load): add support for loading huge sets #3807

Merged
merged 3 commits into from
Sep 29, 2024

Conversation

andydunstall
Copy link
Contributor

@andydunstall andydunstall commented Sep 27, 2024

Adds support for loading huge sets (#3760)

/*
* TODO
*
* I'm quite confused by the existing logic here, were we limit
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i remember correctly, it was more efficient to allocate smaller arrays than a single huge array.
And I agree with you - lets start with a single "static" rule of upto kMaxBlobLen for streaming.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure

// Despite being async, this function can block if the shard queue is full.
FlushShardAsync(sid);
}
out_buf.emplace_back(item);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: move(item)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks updated

Copy link
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good :)

@romange
Copy link
Collaborator

romange commented Sep 28, 2024

you can use rdb_test.cc to add the unit test

@andydunstall
Copy link
Contributor Author

andydunstall commented Sep 28, 2024

Sure I've added a unit test

Testing a 5GB set manually with 5m entries (debug populate 1 test 1000 rand type set elements 5000000)

RSS is still 10GB (5GB used), as it still only does a single flush (since kMaxBufSize x kMaxBlobLen > 8m)

It's also now ~20% slower which seems to be due to not reserving the full set up front anymore. Though can include the full set size in Item to reserve on first load if you think it's worth doing

Repeating the test where we flush if pending_read.remaining > 0, RSS < 7GB (5GB used) and it's 20% faster. Not sure if you think thats worth adding or if theres another flushing approach we should use?

@romange
Copy link
Collaborator

romange commented Sep 28, 2024

I think it's worth reserving capacity if it's easy to add. Also we can reduce now kMaxBlobLen to smaller number since we know large sets will just occupy all the items in ItemsBuf.
using pending_read.remaining as a flush signal is also possible and could be a good short term heuristic.

@andydunstall
Copy link
Contributor Author

Ok sure, I've reduced kMaxBlobLen to 4096 meaning the 5m set is split into ~10 flushes (128x4096 ~= 500k)

And reserved the full set up front

With that loading the 5GB set takes 2.8s (compared to 3.8s before) with RSS ~7GiB (~10GiB before)

@andydunstall andydunstall changed the title feat(rdb_load): add support for loading huge sets (WIP) feat(rdb_load): add support for loading huge sets Sep 28, 2024
@andydunstall andydunstall marked this pull request as ready for review September 28, 2024 09:56
@@ -2505,6 +2557,11 @@ void RdbLoader::LoadItemsBuffer(DbIndex db_ind, const ItemsBuf& ib) {
stop_early_ = true;
break;
}
if (item->load_config.append) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you add this code?

Copy link
Contributor Author

@andydunstall andydunstall Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of a guess given it already does ts->TryStash below in the non-append case - should i remove?

(Or if you mean why does it continue early in the append state, it's because we don't want to check expiry, add to db_slice or update flags in the append case right?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets remove this code.

  1. Data tiering does not work with to non-strings.
  2. We do not want to stash values that are not yet finalized.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure removed

src/server/rdb_load.cc Outdated Show resolved Hide resolved
src/server/rdb_load.h Outdated Show resolved Hide resolved
src/server/rdb_load.h Outdated Show resolved Hide resolved
Copy link
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, plus minor comments

@romange
Copy link
Collaborator

romange commented Sep 29, 2024

Good work @andydunstall . If you want you can follow up with hash-map + zset PR as well.

@andydunstall
Copy link
Contributor Author

If you want you can follow up with hash-map + zset PR as well

Thanks will do

@romange romange merged commit 520dea0 into dragonflydb:main Sep 29, 2024
9 checks passed
romange pushed a commit that referenced this pull request Sep 30, 2024
* feat(rdb_load): add support for loading huge sets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants