Fixing read race condition during pubsub #1737

barshaul · 2021-11-22T09:24:01Z

Pull Request check-list

Please make sure to review and check all of these items:

[X ] Does $ tox pass with this change (including linting)?
[ X] Do the CI tests pass with this change (enable it first in your forked repo and wait for the github action build to finish)?
[X ] Is the new or changed code fully tested?
[ X] Is a documentation update included (if this change modifies existing APIs, or introduces new ones)?

NOTE: these things are not required to open a PR and can be done
afterwards / while the PR is open.

Description of change

closes #1720
closes #1740
closes #1733

Another implementation to #1720 (first impl: #1733)
In this PR i'm adding an option to call pubsub's method get_message() without subscribing first.
If get_message is called and no channel/pattern is subscribed, the method will return None without trying to read from the connection.
When timeout is passed and no channels are yet subscribed, the get_message() function will wait for the first to arrive - either a subscription has been made or the time has expired.

bmerry · 2021-11-22T14:00:44Z

Being able to start polling without making a subscription would be nice - I ran into this limitation recently (although in aioredis).

An immediate problem I can see with this is that get_message can take double the timeout if it first waits just less than timeout for a subscription then another timeout for a message (plus another 0.25s because of the polling in wait_for_subscription).

I feel like this probably still has a race somewhere, because by the time wait_for_subscription returns, the main thread might already have asked to unsubscribe. It may be that works as long as some timing assumptions hold (e.g. round-trip time to the server is less than the health check interval) but breaks if the server suffers high latency. I'll poke at it some more and see if I can produce an explicit example.

bmerry · 2021-11-22T14:43:41Z

Looking at it again, I think the race I was worried about can't happen - provided that there is just one thread calling (un)subscription functions ("main thread") and one thread using get_message/listen (poller thread). Here's my logic, in case anyone wants to double-check. I consider three possible states:

Subscribed: there are subscriptions. self.subscribed is true.
Semi-subscribed: we've issued UNSUBSCRIBE commands for all subscriptions, but not yet processed the responses. self.subscribed is true.
Unsubscribed: we've issued UNSUBSCRIBE commands for all subscriptions and processed the responses. self.subscribed is false.

When get_message passes the wait_for_subscribedcheck, we're in either subscribed or semi-subscribed state. The main thread can trigger oscillations between these states, but cannot cause a transition to unsubscribed on its own (the poller thread does that inhandle_messageby processing the unsubscription response). So whenget_messageis reading the socket, we are guaranteed not to be in unsubscribed state. On the other hand,execute_command` only runs the health check in unsubscribed state, and remains in that state for the duration of the health check.

I think there might still be a bug where PubSub.check_health can run in semi-subscribed state, but I'll file that separately if I manage to reproduce it.

redis/client.py

codecov-commenter · 2021-11-23T11:57:05Z

Codecov Report

Merging #1737 (3cf820f) into master (748c8d1) will decrease coverage by 0.03%.
The diff coverage is 92.06%.

@@            Coverage Diff             @@
##           master    #1737      +/-   ##
==========================================
- Coverage   94.29%   94.26%   -0.04%     
==========================================
  Files          74       75       +1     
  Lines       15696    15942     +246     
==========================================
+ Hits        14801    15028     +227     
- Misses        895      914      +19

Impacted Files	Coverage Δ
redis/client.py	`89.46% <85.29%> (-0.38%)`	⬇️
tests/test_pubsub.py	`99.75% <100.00%> (+0.01%)`	⬆️
redis/__init__.py	`90.47% <0.00%> (-9.53%)`	⬇️
redis/commands/json/commands.py	`88.88% <0.00%> (-6.07%)`	⬇️
tests/conftest.py	`90.14% <0.00%> (-2.75%)`	⬇️
redis/commands/core.py	`89.92% <0.00%> (-0.10%)`	⬇️
redis/cluster.py	`90.23% <0.00%> (-0.07%)`	⬇️
setup.py	`0.00% <0.00%> (ø)`
tests/test_json.py	`100.00% <0.00%> (ø)`
... and 7 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 748c8d1...3cf820f. Read the comment docs.

redis/client.py

barshaul · 2021-11-23T14:21:27Z

Using issue #1740 I found a bug in this fix:

#!/usr/bin/env python3

import threading
import time

from redis import Redis


def poll(ps, event=None):
    print(ps.get_message(timeout=5))
    event.wait()
    while True:
        message = ps.get_message(timeout=5)
        if message is not None:
            print(message)
        else:
            break

def main():
    r = Redis.from_url("redis://localhost", health_check_interval=1)
    ps = r.pubsub()
    ps.subscribe("foo")

    event = threading.Event()
    poller = threading.Thread(target=poll, args=(ps, event))
    poller.start()

    time.sleep(2)
    event.set()
    ps.unsubscribe("foo")
    time.sleep(1)
    ps.subscribe("foo")
    poller.join()

while True:
    main()

If UNSUBSCRIBE response is received before the PING response received: in this case get_message will poll the unsubscribe response, then it will change the subscribed_event flag to false, and then running get_message again will result in None until a new subscription is made. However, in this phase, if we call subscribe again, we have the b"redis-py-health-check" response queued to the socket, so, when we'll run the health check within the execute_command (since now self.subscribed == False), an error will occur:
redis.exceptions.ConnectionError: Bad response from PING health check
because the response is b"redis-py-health-check" .
This issue can be fixed if we'll run the health check from the execute_command method only on the first command execution.

I will work on publishing a fix for both.

barshaul · 2021-11-23T15:00:14Z

@bmerry A fix was added for #1740 and to the bug I mentioned in the comment above.

bmerry

I think the overall approach looks safe now, and I'm unable to crash it with my tests. I've made a suggestion that will fix a new timing-dependent crash.

It is unfortunate that the sequence [subscribe, get_message*, unsubscribe, sleep, subscribe] will no longer benefit from a health check on the later subscribe. In the application where I run into these issues, I have a wrapper class that keeps one permanent PubSub object around and uses it when it wants to wait for some message to be published.

redis/client.py

barshaul · 2021-11-30T10:08:29Z

Added a new solution so we could run more than one health check from execute_command.

clean_health_check_responses will be called from exeute_command, before sending the command or initiating a health check, only If (not self.subscribed).
not self.subscribed can be true in two cases:

The first time we subscribe, then we know for sure that the socket is clean and clean_health_check_responses will immediately return
In case we subscribe->unsubscribe->subscribe. In this case, just after 'unsubscribed' was executed, there are two options:
- get_messages() didn't execute a health check before pulling the 'unsubscribe' response. it only had the 'unsubscribe' response to pull, then it set is_subscribed to False. If 'subscribe' is being called before the 'unsubscribe' response was processed, then is_subscribe will be True, so execute_command will not try to clean the socket. If 'subscribe' is being called after the is_subscribed was set to False, execute_command will call clean_health_check_responses and immediately return - since the socket is empty (get_messages() haven't done a health check). Thus execute_command can perform a health check with no problems.
- get_messages() did execute a health check before pulling the 'unsubscribe' response. In this case we have two options for the order of the responses in the socket:
  1. the health check response is queued before the unsubscribe response. In this case, the health check response will be processed first and the PubSub.parse_response method will ignore it, then it will retrieve the unsubscribe response. 'subscribe' can now execute a health check with no issue.
  2. the unsubscribe response queued before the health check response. then the get_message() will set is_subscribed event to False and stop pulling messages. so we will have the health check response left unread in the socket. In this stage, if subscribe is executed - self.subscribed is False, and therefore clean_health_check_responses will pull the health_check response and return. Then a health check can be issued by execute_command.

I couldn't find a scenario in which we'll clean a response that isn't a health check response. However, I added exception throwing in case it did happen, to make debugging easier.
@bmerry, Please verify me and let me know what you think.

barshaul · 2021-12-02T12:54:16Z

Added a new solution so we could run more than one health check from execute_command.

clean_health_check_responses will be called from exeute_command, before sending the command or initiating a health check, only If (not self.subscribed). not self.subscribed can be true in two cases:

The first time we subscribe, then we know for sure that the socket is clean and clean_health_check_responses will immediately return

In case we subscribe->unsubscribe->subscribe. In this case, just after 'unsubscribed' was executed, there are two options:

get_messages() didn't execute a health check before pulling the 'unsubscribe' response. it only had the 'unsubscribe' response to pull, then it set is_subscribed to False. If 'subscribe' is being called before the 'unsubscribe' response was processed, then is_subscribe will be True, so execute_command will not try to clean the socket. If 'subscribe' is being called after the is_subscribed was set to False, execute_command will call clean_health_check_responses and immediately return - since the socket is empty (get_messages() haven't done a health check). Thus execute_command can perform a health check with no problems.

get_messages() did execute a health check before pulling the 'unsubscribe' response. In this case we have two options for the order of the responses in the socket:

the health check response is queued before the unsubscribe response. In this case, the health check response will be processed first and the PubSub.parse_response method will ignore it, then it will retrieve the unsubscribe response. 'subscribe' can now execute a health check with no issue.

the unsubscribe response queued before the health check response. then the get_message() will set is_subscribed event to False and stop pulling messages. so we will have the health check response left unread in the socket. In this stage, if subscribe is executed - self.subscribed is False, and therefore clean_health_check_responses will pull the health_check response and return. Then a health check can be issued by execute_command.

I couldn't find a scenario in which we'll clean a response that isn't a health check response. However, I added exception throwing in case it did happen, to make debugging easier. @bmerry, Please verify me and let me know what you think.

@bmerry, @chayim I would appreciate a look if you have time.

bmerry · 2021-12-06T07:46:10Z

@bmerry, @chayim I would appreciate a look if you have time.

I'll take a look now.

redis/client.py

…d clear on unsubscription

…m the execute_command method only in the first command execution.

…ot the connection isn't subscribed

bmerry

I think the design looks good now. Unfortunately I need to get my own work all sorted out before I go on holiday later this week, so I won't have time to thoroughly test this or to do further reviewing.

redis/client.py

bmerry · 2021-12-13T06:52:25Z

redis/client.py

+            # Set the subscribed_event flag to True
+            self.subscribed_event.set()
+            # Clear the health check counter
+            self.health_check_response_counter = 0


Should this be done at the end of clean_health_check_response?

I don't want to put it there because the following scenario is possible:

p.subscribe("foo")

health check is performed

p.unsubscribe("foo")

a health check response still hasn't received

p.unsuscribe("foo")

clean_health_check_response is being called by the unsubscribe command, the health check response hasn't arrived yet and it exists the loop due to ttl runs-out

the health check response only now received

p.subscribe() is being called - self.subscribed is still False so a health check will be performed and we should clean the existing health check response before we continue.

If we add 'clean_health_check_response=0' at the end of clean_health_check, we will clean the counter in step 6, so we won't be able to clean the socket from the response on step 8.

chayim

Using the existing socket timeout throughout?

…t_timeout

Signed-off-by: Andrew-Chen-Wang <[email protected]>

barshaul mentioned this pull request Nov 22, 2021

Changed the PubSub's health check command to be performed only on the… #1733

Closed

barshaul marked this pull request as ready for review November 22, 2021 09:37

bmerry reviewed Nov 22, 2021

View reviewed changes

redis/client.py Outdated Show resolved Hide resolved

redis/client.py Show resolved Hide resolved

redis/client.py Outdated Show resolved Hide resolved

redis/client.py Outdated Show resolved Hide resolved

This was referenced Nov 23, 2021

Pubsub: health check races with get_message to read from the socket aio-libs-abandoned/aioredis-py#1217

Open

Another race condition in health checks and pubsub #1740

Closed

bmerry reviewed Nov 23, 2021

View reviewed changes

redis/client.py Outdated Show resolved Hide resolved

bmerry reviewed Nov 29, 2021

View reviewed changes

redis/client.py Show resolved Hide resolved

redis/client.py Outdated Show resolved Hide resolved

redis/client.py Outdated Show resolved Hide resolved

bmerry reviewed Dec 6, 2021

View reviewed changes

redis/client.py Outdated Show resolved Hide resolved

barshaul added 9 commits December 8, 2021 12:40

Changed get_message logic to wait for subscription

2e556ed

Added an threading Event 'subscribed_event' to set on subscription an…

deee567

…d clear on unsubscription

Fixed pubsub tests

ea64f5b

removed unused import

50c0ef8

fixed the new timeout

c2ebd78

1) Fixes issue redis#1740. 2) Changed health check to be executed fro…

4b40412

…m the execute_command method only in the first command execution.

Added a function to clean the socket from health check responses if n…

b26d48f

…ot the connection isn't subscribed

Fixed linters

27ed485

Added a health check response counter

d0a95fa

barshaul force-pushed the health_check_new branch from 0e8b7ee to d0a95fa Compare December 8, 2021 11:19

Clear the health check counter the first subscription

c03ef63

bmerry reviewed Dec 13, 2021

View reviewed changes

chayim suggested changes Dec 22, 2021

View reviewed changes

Changed clean_health_check_response timeout to the connection's socke…

3cf820f

…t_timeout

chayim approved these changes Dec 23, 2021

View reviewed changes

chayim added the bug Bug label Dec 23, 2021

chayim changed the title ~~Resolving read race condition between pubsub's get_message() and execute_command()~~ Fixing read race condition during pubsub Dec 23, 2021

chayim merged commit d6cb997 into redis:master Dec 23, 2021

chayim deleted the health_check_new branch December 23, 2021 09:42

Andrew-Chen-Wang added a commit to aio-libs-abandoned/aioredis-py that referenced this pull request Dec 24, 2021

Update to reflect redis/redis-py#1737

3ff4477

Signed-off-by: Andrew-Chen-Wang <[email protected]>

bonk1t mentioned this pull request Dec 22, 2023

Implement Redis caching bonk1t/agent-os#12

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing read race condition during pubsub #1737

Fixing read race condition during pubsub #1737

barshaul commented Nov 22, 2021 •

edited by chayim

Loading

bmerry commented Nov 22, 2021

bmerry commented Nov 22, 2021

codecov-commenter commented Nov 23, 2021 •

edited

Loading

barshaul commented Nov 23, 2021

barshaul commented Nov 23, 2021

bmerry left a comment

barshaul commented Nov 30, 2021 •

edited

Loading

barshaul commented Dec 2, 2021

bmerry commented Dec 6, 2021

bmerry left a comment

bmerry Dec 13, 2021

barshaul Dec 22, 2021

chayim left a comment

Fixing read race condition during pubsub #1737

Fixing read race condition during pubsub #1737

Conversation

barshaul commented Nov 22, 2021 • edited by chayim Loading

Pull Request check-list

Description of change

bmerry commented Nov 22, 2021

bmerry commented Nov 22, 2021

codecov-commenter commented Nov 23, 2021 • edited Loading

Codecov Report

barshaul commented Nov 23, 2021

barshaul commented Nov 23, 2021

bmerry left a comment

Choose a reason for hiding this comment

barshaul commented Nov 30, 2021 • edited Loading

barshaul commented Dec 2, 2021

bmerry commented Dec 6, 2021

bmerry left a comment

Choose a reason for hiding this comment

bmerry Dec 13, 2021

Choose a reason for hiding this comment

barshaul Dec 22, 2021

Choose a reason for hiding this comment

chayim left a comment

Choose a reason for hiding this comment

barshaul commented Nov 22, 2021 •

edited by chayim

Loading

codecov-commenter commented Nov 23, 2021 •

edited

Loading

barshaul commented Nov 30, 2021 •

edited

Loading