-
-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable JetStream on disk errors #6292
base: main
Are you sure you want to change the base?
Conversation
835c8bd
to
a309fcf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably have better detection of a wider array of filesystem errors like permissions etc, agreed, but I'm not entirely sure why this would be a configurable option and I don't particularly like it either.
The disk being pulled out from underneath JetStream is effectively a catastrophic operational issue and for JS to continue to try to operate in those circumstances feels problematic, i.e. with the potential for data loss that no one may even notice. Even more so in a clustered scenario where the metalayer and Raft logs are being written out to disk.
Thank you for taking the time to review this. |
Reopening PR, as it was closed mistakenly |
Thank you for the suggestion @neilalexander. I have implemented the recommended change and removed the flag. |
Hello @neilalexander, Happy New Year. |
Signed-off-by: Sourabh Agrawal <[email protected]>
Signed-off-by: Sourabh Agrawal <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having the filestore.go
call DisableJetStream
directly may not be the best approach here. In a number of places we already surface errors up to stream.go
, i.e. for isOutOfSpaceErr
.
It feels like the more natural thing to do here would be to extend that same pattern, such that the stream examines surfaced errors and reacts, similar to how we do with the disk out-of-space warnings. Otherwise we end up with different types of disk errors being handled differently and that makes the code difficult to trace.
server/filestore.go
Outdated
fs.expireMsgsOnRecover() | ||
err := fs.expireMsgsOnRecover() | ||
if err != nil && err == errFileSystemPermissionDenied { | ||
fs.srv.Warnf("file system permission denied while expiring msgs, disabling jetstream: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure any log lines are capitalised/cased correctly.
server/filestore.go
Outdated
os.Remove(mb.mfn) | ||
err := os.Remove(mb.mfn) | ||
if err != nil && os.IsPermission(err){ | ||
return errFileSystemPermissionDenied |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to do this rather than just check os.IsPermission
further up the callstack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sense, I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may even make sense to define a new error-detect function for this case, similar to how we already have this in store.go
for out-of-space:
func isOutOfSpaceErr(err error) bool {
return err != nil && (strings.Contains(err.Error(), "no space left"))
}
I agree; it makes sense to handle this within the same pattern. I have implemented the change as you suggested. |
Currently, there are scenarios where NATS JetStream may encounter permission errors when file system goes into read only mode, which can lead to an inconsistent state. In such cases, the system continues to allow publishing messages by resetting stream state, leading to a misaligned consumer stream sequence.
This PR introduces changes to gracefully handle these permission errors and prevent NATS from continuing in an inconsistent state when:
After this PR, If NATS is running in non-clustered mode, the user will be unable to issue write requests until the issue is resolved. In clustered mode, only the affected node will stop accepting requests, while the system will continue to function as long as the required quorum remains healthy.
PR potentially fixes : #6211 which leads to consumer sequence reaching higher than stream sequence.
Signed-off-by: Sourabh Agrawal [email protected]