Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(server): rdb loader catch bad alloc #1748

Merged
merged 2 commits into from
Aug 27, 2023
Merged

Conversation

adiholden
Copy link
Collaborator

@adiholden adiholden commented Aug 27, 2023

While loading rdb snapshot, if oom is reached a bad alloc exception is thrown. Now we catch it and write warning to log.
fixes #1708

While loading rdb snapshot, if oom is reached a bad alloc exception is thrown. Now we
catch it and write warning to log.

Signed-off-by: adi_holden <[email protected]>
if (!added) {
LOG(WARNING) << "RDB has duplicated key '" << item->key << "' in DB " << db_ind;
try {
auto [it, added] = db_slice.AddOrUpdate(db_cntx, item->key, std::move(pv), item->expire_ms);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romange I did not fail the loading if we are OOM, just writing the warning to log. Do you think we should fail loading?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I prefer to fail

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, we are missing a bigger picture here. And I am not saying we should not fix this specific point but the big issue is that both master and replica are configured the same way but the replication fails because we are near the capacity at the master. The master does not fail so why replica fails?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And to @royjacobson point I agree we should fail here: unrolling the operation is better than proceeding with an inconsistent state, but we also should not get to this point at all in this specific scenario.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the changes so that rdb loader load will return error on oom. Note that today if we have error on replica we fail the replica load and on master when using loading from disc we dont fail to start the service if we have errors in rdb loader. If you think we should fail to start the master service on load fail I can do this in a separate PR.
Also I will continue to investigate why replica fails assuming master and replica has the same memory limit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what do you mean we do not fail when master has errors during load? what happens then? Do we load partial data?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Signed-off-by: adi_holden <[email protected]>
@adiholden adiholden merged commit 901d3ff into main Aug 27, 2023
@adiholden adiholden deleted the rdb_load_catch_bad_alloc branch August 27, 2023 10:48
kostasrim pushed a commit that referenced this pull request Aug 29, 2023
While loading rdb snapshot, if oom is reached a bad alloc exception is thrown. Now we
catch it and write warning to log and fali loader.

Signed-off-by: adi_holden <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bad_alloc on replica when master near maxmemory (cache_mode=true)
3 participants