Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Avoid looping over all snapshot ids for each long poll request #45881

Conversation

JoshKarpel
Copy link
Contributor

@JoshKarpel JoshKarpel commented Jun 11, 2024

Why are these changes needed?

See #45872 for background - TLDR we are trying to improve the scalability of the Serve Controller and I'm hunting down optimizations in the controller code.

In this case, it looks like there's an expensive operation happening on each long poll client request - see comments for the details.

Before, 1800 apps with one deployment each and maybe ~5k handles in the system, probably more:

image

After, 1800 apps with one deployment each, averaging ~7.5k active listen_for_change tasks on the controller, with both #45872 and #45878 as well:

image

controller-1800-apps-after-snapshot-fix-with-handles

Related issue number

Discovered while looking at Serve Controller flamegraphs for #45777

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

except KeyError:
# Initial snapshot id must be >= 0, so that the long poll client
# can send a negative initial snapshot id to get a fast update.
self.snapshot_ids[object_key] = 0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was previously a random number https://github.com/ray-project/ray/pull/45881/files#diff-f138b21f7ddcd7d61c0b2704c8b828b9bbe7eb5021531e2c7fabeb20ec322e1aL195 but I couldn't figure out why - I can of course use a random number here but this seems simpler if that's not necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shrekris-anyscale do you know why this was a random number / is that necessary?

Copy link
Contributor

@shrekris-anyscale shrekris-anyscale Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set it to a random number to handle an edge case when the controller crashes and recovers. Suppose the controller always starts with ID 0, and it's currently at ID 1. Suppose all the clients are also at ID 1. Now suppose:

  1. The ServeController (which runs the LongPollHost) crashes and restarts. The LongPollHost's snapshot IDs are reset to 0.
  2. All clients– except 1 slow client– reconnect to the LongPollHost on the ServeController. Their snapshot IDs are still 1, so the controller propagates an update, and all the connected clients' snapshot IDs are set back to 0.
  3. One of the connected clients sends an update to the LongPollHost using notify_changed. The controller bumps the snapshot ID to 1 and updates the connected clients.
  4. The slow client finally connects to the controller. Its ID is also 1, so it doesn't receive the update.

To correctly avoid this edge case, we should cache the snapshot IDs and restore them when the controller recovers. That's pretty complex though, so instead we initialize the snapshot IDs to a random number between 0 and 1,000,000. That makes this edge case very unlikely.

Could we switch back to using a default dict with the random number generator as the factory function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! That totally makes sense, and I will restore that behavior.

Could we switch back to using a default dict with the random number generator as the factory function?

Let me play around with this, I might be able to make it work

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just remembered why I changed it do a plain dict - the problem is that snapshot_ids is a defaultdict, but object_snapshots isn't, so I was trying to get items from one that didn't exist in the other. Now I'm wondering if those two mappings can be combined, since they have the same keys 🤔

python/ray/serve/_private/long_poll.py Show resolved Hide resolved
Comment on lines -401 to +409
self.snapshot_ids[object_key] += 1
try:
self.snapshot_ids[object_key] += 1
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a defaultdict anymore :(

@JoshKarpel JoshKarpel marked this pull request as ready for review June 12, 2024 21:03
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! I left some comments.

python/ray/serve/_private/long_poll.py Show resolved Hide resolved
python/ray/serve/_private/long_poll.py Show resolved Hide resolved
Comment on lines 256 to 266
try:
existing_id = self.snapshot_ids[key]
except KeyError:
# The caller may ask for keys that we don't know about (yet),
# just ignore them.
continue

if existing_id != snapshot_id:
updated_objects[key] = UpdatedObject(
self.object_snapshots[key], existing_id
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid raising and catching an error by using a conditional block.

Suggested change
try:
existing_id = self.snapshot_ids[key]
except KeyError:
# The caller may ask for keys that we don't know about (yet),
# just ignore them.
continue
if existing_id != snapshot_id:
updated_objects[key] = UpdatedObject(
self.object_snapshots[key], existing_id
)
# The caller may ask for keys that we don't know about (yet),
# just ignore them.
if key in self.snapshot_ids:
latest_id = self.snapshot_ids[key]
if latest_id != snapshot_id:
updated_objects[key] = UpdatedObject(
self.object_snapshots[key], latest_id
)

python/ray/serve/_private/long_poll.py Show resolved Hide resolved
python/ray/serve/_private/long_poll.py Outdated Show resolved Hide resolved
except KeyError:
# Initial snapshot id must be >= 0, so that the long poll client
# can send a negative initial snapshot id to get a fast update.
self.snapshot_ids[object_key] = 0
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale Jun 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We set it to a random number to handle an edge case when the controller crashes and recovers. Suppose the controller always starts with ID 0, and it's currently at ID 1. Suppose all the clients are also at ID 1. Now suppose:

  1. The ServeController (which runs the LongPollHost) crashes and restarts. The LongPollHost's snapshot IDs are reset to 0.
  2. All clients– except 1 slow client– reconnect to the LongPollHost on the ServeController. Their snapshot IDs are still 1, so the controller propagates an update, and all the connected clients' snapshot IDs are set back to 0.
  3. One of the connected clients sends an update to the LongPollHost using notify_changed. The controller bumps the snapshot ID to 1 and updates the connected clients.
  4. The slow client finally connects to the controller. Its ID is also 1, so it doesn't receive the update.

To correctly avoid this edge case, we should cache the snapshot IDs and restore them when the controller recovers. That's pretty complex though, so instead we initialize the snapshot IDs to a random number between 0 and 1,000,000. That makes this edge case very unlikely.

Could we switch back to using a default dict with the random number generator as the factory function?

JoshKarpel and others added 3 commits June 21, 2024 14:19
Co-authored-by: shrekris-anyscale <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
Signed-off-by: Josh Karpel <[email protected]>
@shrekris-anyscale shrekris-anyscale added the go add ONLY when ready to merge, run all tests label Jul 1, 2024
Comment on lines 416 to 419
# They should also be randomized to try to avoid situations where,
# if the controller restarts and a client has a now-invalid snapshot id
# that happens to match what the controller restarts with,
# the client wouldn't receive an update.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of explaining the edge case here, could you link my comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

# if the controller restarts and a client has a now-invalid snapshot id
# that happens to match what the controller restarts with,
# the client wouldn't receive an update.
self.snapshot_ids[object_key] = random.randint(0, 1_000_000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that this isn't a defaultdict anymore, could you double check whether any other places we access self.snapshot_ids needs this KeyError check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see four uses:

The first two are in

try:
self.snapshot_ids[object_key] += 1
except KeyError:
# Initial snapshot id must be >= 0, so that the long poll client
# can send a negative initial snapshot id to get a fast update.
# They should also be randomized to try to avoid situations where,
# if the controller restarts and a client has a now-invalid snapshot id
# that happens to match what the controller restarts with,
, which are this block that's protected by the KeyError check.

The next is

try:
existing_id = self.snapshot_ids[key]
except KeyError:
# The caller may ask for keys that we don't know about (yet),
# just ignore them.
# This can happen when, for example,
# a deployment handle is manually created for an app
# that hasn't been deployed yet (by bypassing the safety checks).
continue
, which is also protected by a KeyError check.

The last is

else:
updated_object_key: str = async_task_to_watched_keys[done.pop()]
updated_object = {
updated_object_key: UpdatedObject(
self.object_snapshots[updated_object_key],
self.snapshot_ids[updated_object_key],
)
}
self._count_send(updated_object)
return updated_object
. That block runs when a key changes, which is triggered by
logger.debug(f"LongPollHost: Notify change for key {object_key}.")
, which is below that first block which ensure the snapshot id is present in the mapping, so I think we're good on that one too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks for being thorough here!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My pleasure!

Signed-off-by: Josh Karpel <[email protected]>
Copy link
Contributor

@shrekris-anyscale shrekris-anyscale left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Premerge is failing. Could you merge master and retry?

@JoshKarpel
Copy link
Contributor Author

JoshKarpel commented Jul 1, 2024

Premerge is failing. Could you merge master and retry?

Hmmm, I'm seeing the same failure on my unrelated PR #45939 - looks like maybe CI is busted?

@shrekris-anyscale
Copy link
Contributor

shrekris-anyscale commented Jul 1, 2024

I'm also seeing a similar failure on master (link). I can't merge this PR until it passes premerge. Someone is likely fixing the issue now. Let me check in with the team, and see if there's an ETA.

@shrekris-anyscale
Copy link
Contributor

We're working on a fix here: #46361.

@JoshKarpel
Copy link
Contributor Author

@shrekris-anyscale looks like CI is passing again - mind taking another look at this PR? Thanks in advance!

@shrekris-anyscale shrekris-anyscale merged commit 6beeacb into ray-project:master Jul 8, 2024
6 checks passed
@JoshKarpel
Copy link
Contributor Author

Thanks!

@JoshKarpel JoshKarpel deleted the optimize-outdated-snapshot-id-handling branch July 8, 2024 18:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants