Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(rdb_load): add support for loading huge streams #3855

Merged
merged 2 commits into from
Oct 5, 2024

Conversation

andydunstall
Copy link
Contributor

@andydunstall andydunstall commented Oct 3, 2024

Follows #3850 to add support for loading huge streams (#3760).

This loads the stream entries in partial reads, though loads the stream metadata and consumer groups in a single read (assuming consumer groups will be relatively small so don't need partial reads).

As with lists, loads streams in 512 segments as each stream node can contain 4kb of elements.

Also removes the outer Ltrace::arr as we now only use a single array. This means YieldIfNeeded is also redundant so removed.

Comparing a 5GB stream:

  • main: 4.8s / ~13GB RSS
  • load-huge-streams: 2.6s / ~7GB RSS

ec_ = RdbError(errc::rdb_file_corrupted);
return;
}
// We only load the stream_trace on the final read, so if not read we
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not understand this comment. Can you explain please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've updated the comment

what I mean is ReadStreams is split into two sections:

  • Reading the stream entries (ltrace->arr)
  • Reading the stream metadata and consumer groups (ltrace->stream_trace)

loading the stream metadata and consumer groups in partial reads would be quite complex, and i'm guessing isn't expected to be large enough to require partial reads? so wasn't sure if it's worth trying to load consumer groups in partial reads?

the simplest option seems to be just load the stream entries (ltrace->arr) in partial reads, then on the final read also read the stream metadata and consumer groups (ltrace->stream_trace)

@@ -124,7 +124,15 @@ tuple<const CommandId*, absl::InlinedVector<string, 5>> GeneratePopulateCommand(
}
json[json.size() - 1] = '}'; // Replace last ',' with '}'
args.push_back(json);
} else if (type == "STREAM") {
Copy link
Contributor Author

@andydunstall andydunstall Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this as it's useful for testing, though it is a bit different from the other populate commands since XADD adds a single stream entry with multiple elements in that entry (but the key still has only a single entry which is why the test calls populate 2000 times)

Can remove if preferred and just move this logic into the test (though this sped up the test and useful for manual testing)

@andydunstall andydunstall marked this pull request as ready for review October 4, 2024 04:09
@romange romange merged commit 4dbed3f into dragonflydb:main Oct 5, 2024
9 checks passed
kostasrim pushed a commit that referenced this pull request Oct 7, 2024
* chore: remove RdbLoad Ltrace::arr nested vector

* feat(rdb_load): add support for loading huge streams
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants