Client now retries .map() failures #2734

rculbertson · 2025-01-07T22:23:18Z

Describe your changes

Closes SVC-180.

Adds client retries to .map(). We already did this to .remote() in this PR: #2403

For now, retries are enabled only if the MODAL_CLIENT_RETRIES flag is set to true.

Check these boxes or delete any item (or this section) if not relevant for this PR.

Client+Server: this change is compatible with old servers
Client forward compatibility: this change ensures client can accept data intended for later versions of itself

Note on protobuf: protobuf message changes in one place may have impact to
multiple entities (client, server, worker, database). See points above.

Changelog

When two items had the same timestamp, we would try to sort by the actual item value, which breaks for types that don't support comparison. Instead use a nonce when inserting an item, to ensure that we never have to compare the item value itself.

Though very unlikely outside of unit tests, it's possible to have an output returned before the corresponding retry context has been put into the `pending_outputs` dict.

Once the input queue filled up, we had no more room to put pending retries. And since we had no more room to put retries, we stopped fetching new outputs. And since we stopped fetching new outputs, the server stopped accepting new inputs. As a result, the input queue would never burn down. Instead, use a semaphore to ensure we never have more than 1000 items outstanding.

Instead of using a priority queue, just use the event loop to schedule retries in the future. This significantly simplifies the implementation and makes it much more like the original. Note that we still do have a semaphore that ensures that no more than 1K inputs are in flight (i.e., sent to the server but not completed).

rohansingh · 2025-01-08T19:59:11Z

There were some unit tests on the priority queue that could be restored:
2c1f62c#diff-80ca24dfa2bbe913c31c13fd90c8c0a2cef7cb4ba98e66ca0c5eca38d94c7732

rohansingh · 2025-01-08T20:03:56Z

modal/parallel_map.py

+            if timestamp_seconds == self._MAX_PRIORITY:
+                return None
+            await self._queue.put((timestamp_seconds, idx))
+            await asyncio.sleep(1)


I think using an asyncio.Condition like the previous implementation is better than asyncio.sleep(1) since the latter becomes a busy wait.

modal-client/modal/_utils/async_utils.py

Lines 806 to 810 in afdc7f8

# wait until either the timeout or a new item is added

try:

await asyncio.wait_for(self.condition.wait(), timeout=sleep_time)

except asyncio.TimeoutError:

continue

ah good point. I'll bring back that original implementation.

rohansingh and others added 10 commits January 8, 2025 18:42

Add client retry support to .map

a94eef5

Track in-flight inputs by idx instead of input_id

2ed10b3

Add unit tests for map retries

a40b2dd

Respect MODAL_CLIENT_RETRIES setting for map

8bf3de8

Fix race condition between inputs/outputs

1a8d5b2

Though very unlikely outside of unit tests, it's possible to have an output returned before the corresponding retry context has been put into the `pending_outputs` dict.

wip map retries

7a9416e

More fixes

9c5d968

rculbertson force-pushed the ryan/retry-map branch from d46cc29 to 9c5d968 Compare January 8, 2025 19:12

Make compatible with 3.9

3e2af88

rculbertson requested review from gongy and rohansingh January 8, 2025 19:53

rculbertson marked this pull request as ready for review January 8, 2025 19:54

rohansingh mentioned this pull request Jan 8, 2025

Add client retry support to .map #2571

Closed

rculbertson changed the title ~~WIP Client retries for .map()~~ Client now retries .map() failures Jan 8, 2025

rohansingh reviewed Jan 8, 2025

View reviewed changes

rculbertson added 2 commits January 8, 2025 21:22

Add tests and remove busy wait from queue

c7e6f52

Remove debug logging

8530721

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client now retries .map() failures #2734

Client now retries .map() failures #2734

rculbertson commented Jan 7, 2025 •

edited

Loading

rohansingh commented Jan 8, 2025

rohansingh Jan 8, 2025 •

edited

Loading

rculbertson Jan 8, 2025

	# wait until either the timeout or a new item is added
	try:
	await asyncio.wait_for(self.condition.wait(), timeout=sleep_time)
	except asyncio.TimeoutError:
	continue

Client now retries .map() failures #2734

Are you sure you want to change the base?

Client now retries .map() failures #2734

Conversation

rculbertson commented Jan 7, 2025 • edited Loading

Describe your changes

Changelog

rohansingh commented Jan 8, 2025

rohansingh Jan 8, 2025 • edited Loading

Choose a reason for hiding this comment

rculbertson Jan 8, 2025

Choose a reason for hiding this comment

rculbertson commented Jan 7, 2025 •

edited

Loading

rohansingh Jan 8, 2025 •

edited

Loading