-
Notifications
You must be signed in to change notification settings - Fork 327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Ray] Reconstruct worker #2413
[Ray] Reconstruct worker #2413
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides all these comments, I'm wondering who is the caller of reconstruct_worker
. I don't think it is for the user end as it is not possible to ask an end user to interfere an individual worker in the cluster.
event.set() | ||
except asyncio.CancelledError: | ||
raise | ||
except Exception: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of exceptions are ignored intentionally here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original logic is to ignore the exceptions and stop monitoring sub pools, because no one await the task.(the monitor task fails silently). This change is to log the exceptions and continue monitoring sub pools, I think it's better than fails silently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users need to be informed in time when unexpected errors occurs. More work is needed and a simple error log is far from complete.
The caller of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Merge branch merge_github_2524 of [email protected]:ray-project/mars.git into master https://code.alipay.com/ray-project/mars/pull_requests/58?tab=diff Signed-off-by: 捕牛 <[email protected]> * [Ray] Support reconstructing worker (mars-project#2413) * Make cmdline support third party modules (mars-project#2454) Co-authored-by: hanguang <[email protected]> * Support visualizing subtask graphs on Mars Web (mars-project#2426) * Fix timeout error when waiting for a submitted task (mars-project#2457) * Print the error message when error happens in `TaskProcessor` (mars-project#2458) * Add nightly builds for docker images (mars-project#2456) * Fix misuse of `name` parameter in DataFrame align (mars-project#2469) * Fix hang when start sub pool fails (mars-project#2468) * Refine and unify subtask detail APIs (mars-project#2465) * Fix coverage for Azure pipeline (mars-project#2474) * Split tileable information and subtask graph into two tabs (mars-project#2480) * Support specified vineyard socket and skip the launching vineyardd process (mars-project#2481) * Basic reschedule subtask (mars-project#2467) * Compatible with scikit-learn 1.0 (mars-project#2486) Co-authored-by: hekaisheng <[email protected]> * Fix wrong translation in cluster deployment. (mars-project#2489) * Fix bug that failed to execute query when there are multiple arguments (mars-project#2490) * Include tileable property in detail api (mars-project#2493) * Fix version of statsmodels to pass CI (mars-project#2497) * Implements `glm.LogisticRegression` (mars-project#2466) * Implements bagging sampling (mars-project#2496) * Refine MarsDMatrix & support more parameters for XGB classifier and regressor (mars-project#2498) * Fix output of df.groupby(as_index=False).size() (mars-project#2507) * Add preliminary implementations for ufunc methods (mars-project#2510) * Add doc for reading csv in oss (mars-project#2514) * [Ray] Fix serializing lambdas in web (mars-project#2512) * Add `make_regression` support for learn module (mars-project#2515) * Fix reduction result on empty series (mars-project#2520) * Fix df.loc when df is empty (mars-project#2524) * fix start subpool * fix test_kill_and_wait_timeout * fix autoscale timeout * fix ray larger clsuter fixture * Update ci ray package to 1.2.2 * remove python3.6 3.8 .39 ut and upgrade ray 3.7 image * echo python path * fix json decode error * fix bundle release timeout * fix remove placement group timeout * fix no_restart * fix ci * fix autoscale
Merge branch merge_github_2524 of [email protected]:ray-project/mars.git into master https://code.alipay.com/ray-project/mars/pull_requests/58?tab=diff Signed-off-by: 捕牛 <[email protected]> * [Ray] Support reconstructing worker (mars-project#2413) * Make cmdline support third party modules (mars-project#2454) Co-authored-by: hanguang <[email protected]> * Support visualizing subtask graphs on Mars Web (mars-project#2426) * Fix timeout error when waiting for a submitted task (mars-project#2457) * Print the error message when error happens in `TaskProcessor` (mars-project#2458) * Add nightly builds for docker images (mars-project#2456) * Fix misuse of `name` parameter in DataFrame align (mars-project#2469) * Fix hang when start sub pool fails (mars-project#2468) * Refine and unify subtask detail APIs (mars-project#2465) * Fix coverage for Azure pipeline (mars-project#2474) * Split tileable information and subtask graph into two tabs (mars-project#2480) * Support specified vineyard socket and skip the launching vineyardd process (mars-project#2481) * Basic reschedule subtask (mars-project#2467) * Compatible with scikit-learn 1.0 (mars-project#2486) Co-authored-by: hekaisheng <[email protected]> * Fix wrong translation in cluster deployment. (mars-project#2489) * Fix bug that failed to execute query when there are multiple arguments (mars-project#2490) * Include tileable property in detail api (mars-project#2493) * Fix version of statsmodels to pass CI (mars-project#2497) * Implements `glm.LogisticRegression` (mars-project#2466) * Implements bagging sampling (mars-project#2496) * Refine MarsDMatrix & support more parameters for XGB classifier and regressor (mars-project#2498) * Fix output of df.groupby(as_index=False).size() (mars-project#2507) * Add preliminary implementations for ufunc methods (mars-project#2510) * Add doc for reading csv in oss (mars-project#2514) * [Ray] Fix serializing lambdas in web (mars-project#2512) * Add `make_regression` support for learn module (mars-project#2515) * Fix reduction result on empty series (mars-project#2520) * Fix df.loc when df is empty (mars-project#2524) * fix start subpool * fix test_kill_and_wait_timeout * fix autoscale timeout * fix ray larger clsuter fixture * Update ci ray package to 1.2.2 * remove python3.6 3.8 .39 ut and upgrade ray 3.7 image * echo python path * fix json decode error * fix bundle release timeout * fix remove placement group timeout * fix no_restart * fix ci * fix autoscale
What do these changes do?
ClusterAPI.reconstruct_worker
API.ClusterAPI.reconstruct_worker
for ray.FaultInjectionSubtaskProcessor
to fault_injection_patch.pytest_rerun_subtask_describe
case.monitor_sub_pools
instead of stop monitoring silently.register_slot
API forBandSlotControlActor
.Related issue number