-
Notifications
You must be signed in to change notification settings - Fork 569
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storegateway: Detect OOM During Eagerly Loading Index Header #5644
storegateway: Detect OOM During Eagerly Loading Index Header #5644
Conversation
The OOM marker should indicate when store gateway crashes before and in the existence of the marker, we will skip eager loading.
Signed-off-by: Jon Kartago Lamida <[email protected]>
Signed-off-by: Jon Kartago Lamida <[email protected]>
Signed-off-by: Jon Kartago Lamida <[email protected]>
Signed-off-by: Jon Kartago Lamida <[email protected]>
Signed-off-by: Jon Kartago Lamida <[email protected]>
3151e5a
to
84f7f1b
Compare
Signed-off-by: Jon Kartago Lamida <[email protected]>
|
||
// Exists returns true if a file existed in the path p. | ||
// TODO: move shutdown marker usage to use this instead | ||
func Exists(p string) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with moving this out of shutdownmarker
, but I'm not sure this belongs in atomicfs
, as it's not an atomic file operation. Perhaps this can go in a general filesystem-related package?
@@ -26,6 +27,7 @@ import ( | |||
) | |||
|
|||
var lazyLoadedHeadersListFile = "lazy-loaded.json" | |||
var eagerLoadOOMMarkerFile = "eager-load-oom-marker.txt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should call this something else - while OOMing is one reason why the store-gateway could crash while eagerly loading index headers, it's not the only reason.
@@ -214,6 +216,34 @@ func (p *ReaderPool) NewBinaryReader(ctx context.Context, logger log.Logger, bkt | |||
return reader, err | |||
} | |||
|
|||
func (p *ReaderPool) tryEagerLoadIndexHeadersSnapshot(lazyBinaryReader *LazyBinaryReader, dir string, blockID ulid.ULID) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to handle failing to eagerly load index headers at a higher level.
If I understand correctly, this will create and delete the marker file once per block, and that marker is scoped per-tenant.
However, if store-gateway crashes while eagerly loading index headers (eg. because we OOMed), I'm thinking it's best to not eagerly load any index headers from any tenant when the store-gateway restarts.
With the current setup, we'll only skip the tenants that were being loaded at the time of the crash, and still try to eagerly load all other tenants. However, imagine a scenario where the first tenant eagerly loads X GB, but the next tenant loads just enough to push the store-gateway over its memory limit, causing it to OOM and be killed. On the next run, we'll still load the first tenant's index headers, and only skip the second tenant.
@charleskorn before I continue to address your comment in this PR, last week, I had discussion with @dimitarvdimitrov on the need to have this store gateway crash handler. TLDR: We might not need this PR. It is because in #5596 we will remove the snapshot file after loading the block list from the json file. Only after that we will eager load all blocks listed in the json file. If one of block eager loading caused store gateway to crash, after restart, the json file is already gone, hence no more eager loading. |
That sounds great to me, much simpler 🙂 |
Closing this as the discussion above. |
Note: This PR is purposely targeting
lamida/sg-eager-load-headers
(#5596) for easier diffs, later the target will be changed tolamida/sg-lazy-load
branch once #5596 is merged. Eventually, all lazy loading PRs will be merged tomain
branch in #5606.What this PR does
Detect OOM in store gateway during eagerly loading lazy loaded index headers in startup. We are writing a marker file when store gateway starting to do eager loading and removing the marker once eager loading runs successfully. If there is a marker file at the beginning of eager loading, that could indicate previous eager loading failed, probably by OOM, and we skip eager loading the index header for such cases.
As part of this PR, I also moved
Exists
method fromshutdownmarker
package toatomicfs
package because the later package seems more appropriate for the general use of the method.Which issue(s) this PR fixes or relates to
Fixes #4762
Checklist
CHANGELOG.md
updated - the order of entries should be[CHANGE]
,[FEATURE]
,[ENHANCEMENT]
,[BUGFIX]