Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CHORE] Refactor RayRunner so that we can add tracing #3163

Merged
merged 3 commits into from
Nov 5, 2024

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Nov 1, 2024

This PR refactors the RayRunner so that it is easier to add tracing.

I also added more docstrings to make it clearer about what is happening.

Code Changes

I highlight changes that were made in the code for easier review.

  1. I removed the next_step state, and instead expose a new has_next return variable from self._construct_dispatch_batch. This cleans up the code because iteration on the physical plan now ONLY happens inside of self._construct_dispatch_batch instead of being scattered across the scheduling loop.
  2. I cleaned up the logic in self._await_tasks by explicitly waiting on one task (with timeout=None) to first wait for any task to complete, and then perform an actual wait on all tasks (with timeout=0.01) to actually retrieve tasks that are ready.
  3. I pulled out self._is_active and self._place_in_queue into methods, instead of them being locally-defined functions

@github-actions github-actions bot added the chore label Nov 1, 2024
Copy link

codspeed-hq bot commented Nov 1, 2024

CodSpeed Performance Report

Merging #3163 will not alter performance

Comparing jay/rayrunner-refactor (850f3db) with main (3cef614)

Summary

✅ 17 untouched benchmarks

if dispatches_allowed == 0 or next_step is None:
# Break the dispatch batching/dispatch loop if no more dispatches allowed, or physical plan
# needs work for forward progress
if dispatches_allowed == 0 or not has_next:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: behavior change here. We now use a has_next variable that is returned from self._construct_dispatch_batch to figure out whether or now we should break the DispatchBatching -> Dispatch loop.

Previously the scheduling loop had to keep track of next_step which was super weird and unwieldy, and caused us to call next(tasks) in a bunch of random places scattered throughout the loop.

num_returns=1,
timeout=None,
fetch_local=False,
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: slight behavior change here compared to previous awaiting logic

I call a ray.wait here but discard the outputs to just wait on one item to be ready with timeout=None.

Then I subsequently call ray.wait again with a 0.01 timeout to actually retrieve a batch of ready tasks.

I think this logic is a little easier to follow, and gets rid of the weird loop over ("next_one", "next_batch") that we had earlier. Also shouldn't have too much of a performance impact.

self.results_by_df[result_uuid].put(item, timeout=0.1)
break
except Full:
pass
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got rid of these weird locally-defined callbacks by making them methods, so that I can call them elsewhere without having to pass them around.

@jaychia jaychia requested review from kevinzwang and colin-ho and removed request for colin-ho November 1, 2024 22:36
Copy link

codecov bot commented Nov 1, 2024

Codecov Report

Attention: Patch coverage is 91.42857% with 6 lines in your changes missing coverage. Please review.

Project coverage is 78.61%. Comparing base (3cef614) to head (850f3db).
Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
daft/runners/ray_runner.py 91.42% 6 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3163      +/-   ##
==========================================
- Coverage   79.00%   78.61%   -0.40%     
==========================================
  Files         634      634              
  Lines       76943    77961    +1018     
==========================================
+ Hits        60789    61289     +500     
- Misses      16154    16672     +518     
Files with missing lines Coverage Δ
daft/runners/ray_runner.py 80.95% <91.42%> (-0.31%) ⬇️

... and 7 files with indirect coverage changes

Copy link
Member

@kevinzwang kevinzwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jaychia jaychia merged commit 64e35f8 into main Nov 5, 2024
42 checks passed
@jaychia jaychia deleted the jay/rayrunner-refactor branch November 5, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants