Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storegateway: Detect OOM During Eagerly Loading Index Header #5644

Closed

Conversation

lamida
Copy link
Contributor

@lamida lamida commented Aug 1, 2023

Note: This PR is purposely targeting lamida/sg-eager-load-headers (#5596) for easier diffs, later the target will be changed to lamida/sg-lazy-load branch once #5596 is merged. Eventually, all lazy loading PRs will be merged to main branch in #5606.

What this PR does

Detect OOM in store gateway during eagerly loading lazy loaded index headers in startup. We are writing a marker file when store gateway starting to do eager loading and removing the marker once eager loading runs successfully. If there is a marker file at the beginning of eager loading, that could indicate previous eager loading failed, probably by OOM, and we skip eager loading the index header for such cases.

As part of this PR, I also moved Exists method from shutdownmarker package to atomicfs package because the later package seems more appropriate for the general use of the method.

Which issue(s) this PR fixes or relates to

Fixes #4762

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@lamida lamida changed the title Detect OOM During Eagerly Loading Index Header storegateway: Detect OOM During Eagerly Loading Index Header Aug 1, 2023
@lamida lamida marked this pull request as ready for review August 1, 2023 10:43
@lamida lamida requested a review from a team as a code owner August 1, 2023 10:43
@lamida lamida requested a review from charleskorn August 1, 2023 10:43
lamida added 6 commits August 1, 2023 19:18
The OOM marker should indicate when store gateway crashes before
and in the existence of the marker, we will skip eager loading.
Signed-off-by: Jon Kartago Lamida <[email protected]>
@lamida lamida force-pushed the lamida/sg-eager-load-oom-handler branch from 3151e5a to 84f7f1b Compare August 1, 2023 11:25
@lamida lamida requested a review from dimitarvdimitrov August 1, 2023 11:39
Signed-off-by: Jon Kartago Lamida <[email protected]>

// Exists returns true if a file existed in the path p.
// TODO: move shutdown marker usage to use this instead
func Exists(p string) (bool, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with moving this out of shutdownmarker, but I'm not sure this belongs in atomicfs, as it's not an atomic file operation. Perhaps this can go in a general filesystem-related package?

@@ -26,6 +27,7 @@ import (
)

var lazyLoadedHeadersListFile = "lazy-loaded.json"
var eagerLoadOOMMarkerFile = "eager-load-oom-marker.txt"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should call this something else - while OOMing is one reason why the store-gateway could crash while eagerly loading index headers, it's not the only reason.

@@ -214,6 +216,34 @@ func (p *ReaderPool) NewBinaryReader(ctx context.Context, logger log.Logger, bkt
return reader, err
}

func (p *ReaderPool) tryEagerLoadIndexHeadersSnapshot(lazyBinaryReader *LazyBinaryReader, dir string, blockID ulid.ULID) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to handle failing to eagerly load index headers at a higher level.

If I understand correctly, this will create and delete the marker file once per block, and that marker is scoped per-tenant.

However, if store-gateway crashes while eagerly loading index headers (eg. because we OOMed), I'm thinking it's best to not eagerly load any index headers from any tenant when the store-gateway restarts.

With the current setup, we'll only skip the tenants that were being loaded at the time of the crash, and still try to eagerly load all other tenants. However, imagine a scenario where the first tenant eagerly loads X GB, but the next tenant loads just enough to push the store-gateway over its memory limit, causing it to OOM and be killed. On the next run, we'll still load the first tenant's index headers, and only skip the second tenant.

@lamida
Copy link
Contributor Author

lamida commented Aug 8, 2023

@charleskorn before I continue to address your comment in this PR, last week, I had discussion with @dimitarvdimitrov on the need to have this store gateway crash handler.

TLDR: We might not need this PR.

It is because in #5596 we will remove the snapshot file after loading the block list from the json file. Only after that we will eager load all blocks listed in the json file. If one of block eager loading caused store gateway to crash, after restart, the json file is already gone, hence no more eager loading.

@charleskorn
Copy link
Contributor

@charleskorn before I continue to address your comment in this PR, last week, I had discussion with @dimitarvdimitrov on the need to have this store gateway crash handler.

TLDR: We might not need this PR.

It is because in #5596 we will remove the snapshot file after loading the block list from the json file. Only after that we will eager load all blocks listed in the json file. If one of block eager loading caused store gateway to crash, after restart, the json file is already gone, hence no more eager loading.

That sounds great to me, much simpler 🙂

@lamida
Copy link
Contributor Author

lamida commented Aug 10, 2023

Closing this as the discussion above.

@lamida lamida closed this Aug 10, 2023
@lamida lamida deleted the lamida/sg-eager-load-oom-handler branch October 22, 2024 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants