Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray] Reconstruct worker #2413

Merged
merged 14 commits into from
Sep 14, 2021
Merged

Conversation

fyrestone
Copy link
Contributor

@fyrestone fyrestone commented Sep 2, 2021

What do these changes do?

  • Add the ClusterAPI.reconstruct_worker API.
  • Implement ClusterAPI.reconstruct_worker for ray.
  • New ray actor pool states: INIT, POOL_READY, SERVICE_READY.
  • Move FaultInjectionSubtaskProcessor to fault_injection_patch.py
  • More reliable kill and wait logic for ray.
  • Add test_rerun_subtask_describe case.
  • Log exceptions in monitor_sub_pools instead of stop monitoring silently.
  • Add the register_slot API for BandSlotControlActor.
  • Fix one slot may be assigned more than one subtask when rerun subtask.

Related issue number

@fyrestone fyrestone self-assigned this Sep 2, 2021
@fyrestone fyrestone marked this pull request as ready for review September 8, 2021 07:46
Copy link
Member

@wjsi wjsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides all these comments, I'm wondering who is the caller of reconstruct_worker. I don't think it is for the user end as it is not possible to ask an end user to interfere an individual worker in the cluster.

mars/deploy/oscar/ray.py Outdated Show resolved Hide resolved
mars/deploy/oscar/ray.py Show resolved Hide resolved
event.set()
except asyncio.CancelledError:
raise
except Exception:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of exceptions are ignored intentionally here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original logic is to ignore the exceptions and stop monitoring sub pools, because no one await the task.(the monitor task fails silently). This change is to log the exceptions and continue monitoring sub pools, I think it's better than fails silently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users need to be informed in time when unexpected errors occurs. More work is needed and a simple error log is far from complete.

mars/services/scheduling/worker/execution.py Show resolved Hide resolved
mars/services/scheduling/worker/workerslot.py Outdated Show resolved Hide resolved
mars/deploy/oscar/tests/test_fault_injection.py Outdated Show resolved Hide resolved
@fyrestone
Copy link
Contributor Author

Besides all these comments, I'm wondering who is the caller of reconstruct_worker. I don't think it is for the user end as it is not possible to ask an end user to interfere an individual worker in the cluster.

The caller of reconstruct_worker should be a monitor logic in the supervisor. it's an API to operate the cluster worker, so I think it can be placed next to the request_worker and release_worker.

Copy link
Member

@wjsi wjsi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@qinxuye qinxuye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@qinxuye qinxuye merged commit c0969a1 into mars-project:master Sep 14, 2021
chaokunyang added a commit to chaokunyang/mars that referenced this pull request May 31, 2022
Merge branch merge_github_2524 of [email protected]:ray-project/mars.git into master
https://code.alipay.com/ray-project/mars/pull_requests/58?tab=diff

Signed-off-by: 捕牛 <[email protected]>


* [Ray] Support reconstructing worker (mars-project#2413)


* Make cmdline support third party modules (mars-project#2454)

Co-authored-by: hanguang <[email protected]>
* Support visualizing subtask graphs on Mars Web (mars-project#2426)


* Fix timeout error when waiting for a submitted task (mars-project#2457)


* Print the error message when error happens in `TaskProcessor` (mars-project#2458)


* Add nightly builds for docker images (mars-project#2456)


* Fix misuse of `name` parameter in DataFrame align (mars-project#2469)


* Fix hang when start sub pool fails (mars-project#2468)


* Refine and unify subtask detail APIs (mars-project#2465)


* Fix coverage for Azure pipeline (mars-project#2474)


* Split tileable information and subtask graph into two tabs (mars-project#2480)


* Support specified vineyard socket and skip the launching vineyardd process (mars-project#2481)


* Basic reschedule subtask (mars-project#2467)


* Compatible with scikit-learn 1.0 (mars-project#2486)

Co-authored-by: hekaisheng <[email protected]>
* Fix wrong translation in cluster deployment. (mars-project#2489)


* Fix bug that failed to execute query when there are multiple arguments (mars-project#2490)


* Include tileable property in detail api (mars-project#2493)


* Fix version of statsmodels to pass CI (mars-project#2497)


* Implements `glm.LogisticRegression` (mars-project#2466)


* Implements bagging sampling (mars-project#2496)


* Refine MarsDMatrix & support more parameters for XGB classifier and regressor (mars-project#2498)


* Fix output of df.groupby(as_index=False).size() (mars-project#2507)


* Add preliminary implementations for ufunc methods (mars-project#2510)


* Add doc for reading csv in oss (mars-project#2514)


* [Ray] Fix serializing lambdas in web (mars-project#2512)


* Add `make_regression` support for learn module (mars-project#2515)


* Fix reduction result on empty series (mars-project#2520)


* Fix df.loc when df is empty (mars-project#2524)


* fix start subpool

* fix test_kill_and_wait_timeout

* fix autoscale timeout

* fix ray larger clsuter fixture

* Update ci ray package to 1.2.2

* remove python3.6 3.8 .39 ut and upgrade ray 3.7 image

* echo python path

* fix json decode error

* fix bundle release timeout

* fix remove placement group timeout

* fix no_restart

* fix ci

* fix autoscale
chaokunyang added a commit to chaokunyang/mars that referenced this pull request May 31, 2022
Merge branch merge_github_2524 of [email protected]:ray-project/mars.git into master
https://code.alipay.com/ray-project/mars/pull_requests/58?tab=diff

Signed-off-by: 捕牛 <[email protected]>

* [Ray] Support reconstructing worker (mars-project#2413)

* Make cmdline support third party modules (mars-project#2454)

Co-authored-by: hanguang <[email protected]>
* Support visualizing subtask graphs on Mars Web (mars-project#2426)

* Fix timeout error when waiting for a submitted task (mars-project#2457)

* Print the error message when error happens in `TaskProcessor` (mars-project#2458)

* Add nightly builds for docker images (mars-project#2456)

* Fix misuse of `name` parameter in DataFrame align (mars-project#2469)

* Fix hang when start sub pool fails (mars-project#2468)

* Refine and unify subtask detail APIs (mars-project#2465)

* Fix coverage for Azure pipeline (mars-project#2474)

* Split tileable information and subtask graph into two tabs (mars-project#2480)

* Support specified vineyard socket and skip the launching vineyardd process (mars-project#2481)

* Basic reschedule subtask (mars-project#2467)

* Compatible with scikit-learn 1.0 (mars-project#2486)

Co-authored-by: hekaisheng <[email protected]>
* Fix wrong translation in cluster deployment. (mars-project#2489)

* Fix bug that failed to execute query when there are multiple arguments (mars-project#2490)

* Include tileable property in detail api (mars-project#2493)

* Fix version of statsmodels to pass CI (mars-project#2497)

* Implements `glm.LogisticRegression` (mars-project#2466)

* Implements bagging sampling (mars-project#2496)

* Refine MarsDMatrix & support more parameters for XGB classifier and regressor (mars-project#2498)

* Fix output of df.groupby(as_index=False).size() (mars-project#2507)

* Add preliminary implementations for ufunc methods (mars-project#2510)

* Add doc for reading csv in oss (mars-project#2514)

* [Ray] Fix serializing lambdas in web (mars-project#2512)

* Add `make_regression` support for learn module (mars-project#2515)

* Fix reduction result on empty series (mars-project#2520)

* Fix df.loc when df is empty (mars-project#2524)

* fix start subpool

* fix test_kill_and_wait_timeout

* fix autoscale timeout

* fix ray larger clsuter fixture

* Update ci ray package to 1.2.2

* remove python3.6 3.8 .39 ut and upgrade ray 3.7 image

* echo python path

* fix json decode error

* fix bundle release timeout

* fix remove placement group timeout

* fix no_restart

* fix ci

* fix autoscale
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants