Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

query: fix deadlock in endpointset #4795

Merged
merged 1 commit into from
Oct 21, 2021
Merged

Conversation

GiedriusS
Copy link
Member

@GiedriusS GiedriusS commented Oct 21, 2021

Avoid RLock()ing twice as described here:
#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with er.clients != nil.

Closes #4766.

Signed-off-by: Giedrius Statkevičius [email protected]

Avoid RLock()ing twice as described here:
thanos-io#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix!

@GiedriusS GiedriusS merged commit 0429764 into thanos-io:main Oct 21, 2021
@GiedriusS GiedriusS deleted the fix_deadlock branch October 21, 2021 18:51
GiedriusS added a commit to GiedriusS/thanos that referenced this pull request Oct 22, 2021
Avoid RLock()ing twice as described here:
thanos-io#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
@hanjm
Copy link
Member

hanjm commented Dec 1, 2021

Hi @GiedriusS @squat , I found at least 4 people report this bug, it critical because query will halt . Could we released a miro version like 0.23.2, just cherry pick this import pr. Thanks very much.

@squat
Copy link
Member

squat commented Dec 1, 2021

This PR is in the release candidate for 0.24.0. i think that's enough. Or do you think it's so severe that we need to back port the fix?

@hanjm
Copy link
Member

hanjm commented Dec 1, 2021

@squat I think it's severe to do that. Someone use 0.23.x always face this deadlock issue, the query api will halt and process will not auto reboot, It will be nice to have 0.23.2 for 0.23.x series. 😄

I am also face this issue on production upgrade, other case like https://cloud-native.slack.com/archives/CK5RSSC10/p1637585247351700 https://cloud-native.slack.com/archives/CK5RSSC10/p1637265253335900 .

aymericDD pushed a commit to aymericDD/thanos that referenced this pull request Dec 6, 2021
Avoid RLock()ing twice as described here:
thanos-io#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
aymericDD pushed a commit to aymericDD/thanos that referenced this pull request Dec 7, 2021
Avoid RLock()ing twice as described here:
thanos-io#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Aymeric <[email protected]>
bwplotka pushed a commit that referenced this pull request Dec 7, 2021
Avoid RLock()ing twice as described here:
#4766 (comment)
(due to
https://stackoverflow.com/questions/30547916/goroutine-blocks-when-calling-rwmutex-rlock-twice-after-an-rwmutex-unlock/30549188).
Fix it by removing HasClients() and simply changing it with `er.clients != nil`.

Signed-off-by: Giedrius Statkevičius <[email protected]>
Signed-off-by: Aymeric <[email protected]>

Co-authored-by: Giedrius Statkevičius <[email protected]>
george-angel added a commit to utilitywarehouse/thanos-manifests that referenced this pull request Dec 22, 2021
Mainly to address thanos-io/thanos#4795 - which
is happening frequently in prod-aws/sys-mon
george-angel added a commit to utilitywarehouse/thanos-manifests that referenced this pull request Dec 22, 2021
Mainly to address thanos-io/thanos#4795 - which
is happening frequently in prod-aws/sys-mon
JoaoBraveCoding added a commit to JoaoBraveCoding/cluster-monitoring-operator that referenced this pull request Nov 4, 2022
Issue: https://issues.redhat.com/browse/OCPBUGS-2037

Problem: thanos-io/thanos#4766

Solution: update to 0.23.2 that contains the patch thanos-io/thanos#4795

Signed-off-by: JoaoBraveCoding <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

query: Query API is unresponsive but health/ready APIs report healthy/ready
4 participants