Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#7514 Fix channel deadlock in meta sync fetcher #7933

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mfoldenyi
Copy link

@mfoldenyi mfoldenyi commented Nov 22, 2024

Worker threads no longer exit on error, instead they collect them into a multiError and keep working.

Fix proposal for #7514

  • I added CHANGELOG entry for this change due to it fixing a deadlock that potentially affect all components
  • Change is not relevant to the end user.

Changes

Changed concurrent worker behavior in case of error in Exists call, they no longer exit on error, instead collect errors in a multiError and continue working.

Verification

Added unit test simulating errors in the objstore. It is sending a sufficient amount of ULIDs to check and simulates failure on all of them. In the old scenario this would have resulted in a deadlock.

@mfoldenyi mfoldenyi marked this pull request as ready for review November 22, 2024 08:58
@mfoldenyi
Copy link
Author

@fpetkovski Could you please take a look when you have some time?

Copy link
Member

@GiedriusS GiedriusS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way we could add a unit test here? I'm afraid that otherwise we will have this problem again

@mfoldenyi
Copy link
Author

@GiedriusS Sure, I will try to implement a test which simulates store errors. it should be running into the deadlock with the original code, but not with the patch. I hopefully can have something by the end of the week.

@pull-request-size pull-request-size bot added size/L and removed size/M labels Dec 7, 2024
@mfoldenyi mfoldenyi force-pushed the issue-7514 branch 2 times, most recently from e9b250d to 3f7b3aa Compare December 7, 2024 16:16
@mfoldenyi
Copy link
Author

@GiedriusS I created a PR with just the test case to show the error by itself with the code on main: https://github.com/thanos-io/thanos/pull/7966/files

The same test case is passing with the provided fix.

pkg/block/fetcher.go Outdated Show resolved Hide resolved
multiError = multierror.Append(eg.Wait(), multiError)

if multiError != nil {
if multiErrorUnwrap, ok := err.(interface{ Unwrap() []error }); ok {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should always succeed unless we change the library, right? Perhaps in the !ok clause we could put a panic("BUG: multierr doesn't implement unwrap") to make sure that in the future we would notice this part if suddenly err wouldn't implement this interface?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the panic with a comment. Seeing that Unwrap is basically the core feature that go-multierror adds to the mix this should never happen. The panic therefore would be absolutely warranted if it still happens. :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran into a bunch of issues mostly due to the panic. I then realised that this could be so much simpler, and I refactored it so that the panic is no longer necessary, it would just give a compile error if something major changed in the imported package.

I also realised that my test case's success condition was incorrect. I cannot assert on no errors occuring, since the whole point is that simulated errors are occurring. What I should (and now I am) testing for is that I get exactly the expected number of errors. If there was a deadlock, it would be missing some of them.

I ended up squashing the entire change-set as I forgot to sign the commits. ( :/ )

@mfoldenyi
Copy link
Author

Squashed and rebased the reviewed version, added the new changes in a new commit.

@mfoldenyi mfoldenyi force-pushed the issue-7514 branch 3 times, most recently from 5c3fe50 to 2a31948 Compare December 23, 2024 18:03
Errors no longer take out the thread with them, instead are collected into a multierror.

Signed-off-by: Miklós Földényi <[email protected]>
@mfoldenyi
Copy link
Author

Rebased on main. Please let me know if there is anything else for this to be merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants