Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ray start block in Pod's entrypoint #77

Merged
merged 7 commits into from
Dec 2, 2021

Conversation

chenk008
Copy link
Contributor

Signed-off-by: chenk008 [email protected]

Why are these changes needed?

When ray process(e.g. raylet,gcs) exited, the Pod should restart so that the ray process can failover.

Generate the Pod args like ray start --block. And this PR remove test script ray-code, when ray starts with block, the ray-code will not be executed.

Related issue number

Close #62

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

Signed-off-by: chenk008 <[email protected]>
@chenk008 chenk008 changed the title use ray start block Use ray start block in Pod's entrypoint Oct 18, 2021
@chenk008 chenk008 requested review from akanso and Jeffwan and removed request for akanso October 18, 2021 14:07
@chenk008
Copy link
Contributor Author

@akanso @Jeffwan ray-project/ray#19546 is merged. I think this PR is a workaround to support raylet failover util we get a liveness Exec probes.

@Jeffwan
Copy link
Collaborator

Jeffwan commented Nov 26, 2021

@chenk008 Let's rebase the change and move this forward. As @chenk008 mentioned, here's another issue encountered the same problem #104

@chenk008
Copy link
Contributor Author

chenk008 commented Nov 29, 2021

I have removed sample_code in raycluster CRD samples. I think we can move the job submission to raycluster CRD. Here is the related issue #106

@Jeffwan
Copy link
Collaborator

Jeffwan commented Nov 29, 2021

Sounds good. One last comment, since Ali raise the question, can we at least keep one example file to use code.py? (use custom command to override it) User will know how to use current solution to submit jobs etc. What's more, if some user do not like block way, they know how to change back to sleep infinity way. @chenk008

@akanso
Copy link
Collaborator

akanso commented Nov 29, 2021

yes, that is a good idea, to have one example using the --block, and the others without it

@Jeffwan
Copy link
Collaborator

Jeffwan commented Nov 30, 2021

@chenk008 Did you get a chance to verify the changes. I use following steps to verify the restarts.

  1. Create a cluster using --block, verify from dashboard there's 1 head and 1 worker
  2. Delete head node and wait for worker node to join the ray cluster
  3. However, seems even connection is broken, the worker won't exit unexpectedly and it is still up.
k logs -f raycluster-complete-worker-small-group-q9bcs
[2021-11-30 06:45:41,315 W 8 8] global_state_accessor.cc:365: Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?


(base) ray@raycluster-complete-worker-small-group-q9bcs:~$ ls -al /tmp/ray/session_latest/logs/raylet.*
-rw-r--r-- 1 ray users    0 Nov 30 06:45 /tmp/ray/session_latest/logs/raylet.err
-rw-r--r-- 1 ray users 1691 Nov 30 06:45 /tmp/ray/session_latest/logs/raylet.out

raylet.out logs

k exec -it raycluster-complete-worker-small-group-q9bcs  bash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl exec [POD] -- [COMMAND] instead.
Defaulted container "machine-learning" out of: machine-learning, init-myservice (init)
(base) ray@raycluster-complete-worker-small-group-q9bcs:~$ tail -f /tmp/ray/session_latest/logs/raylet.out
[2021-11-30 06:45:41,320 I 17 17] store_runner.cc:46: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled
[2021-11-30 06:45:41,320 I 17 41] dlmalloc.cc:146: create_and_mmap_buffer(306053128, /dev/shm/plasmaXXXXXX)
[2021-11-30 06:45:41,321 I 17 17] grpc_server.cc:71: ObjectManager server started, listening on port 41749.
[2021-11-30 06:45:41,323 I 17 17] node_manager.cc:285: Initializing NodeManager with ID fcdf106611513fb2597f0d2ea55e12550f7cefb2f518004539d27272
[2021-11-30 06:45:41,323 I 17 17] grpc_server.cc:71: NodeManager server started, listening on port 36855.
[2021-11-30 06:45:41,326 I 17 50] agent_manager.cc:78: Monitor agent process with pid 49, register timeout 30000ms.
[2021-11-30 06:45:41,328 I 17 17] raylet.cc:100: Raylet of id, fcdf106611513fb2597f0d2ea55e12550f7cefb2f518004539d27272 started. Raylet consists of node_manager and object_manager. node_manager address: 10.244.0.21:36855 object_manager address: 10.244.0.21:41749 hostname: 10.244.0.21
[2021-11-30 06:45:41,334 I 17 17] service_based_accessor.cc:610: Received notification for node id = 79612b988e8802e35b4c2ab179b71c6d065c0be48cdbd27035d8b88a, IsAlive = 1
[2021-11-30 06:45:41,334 I 17 17] service_based_accessor.cc:610: Received notification for node id = fcdf106611513fb2597f0d2ea55e12550f7cefb2f518004539d27272, IsAlive = 1
[2021-11-30 06:45:42,664 I 17 17] agent_manager.cc:34: HandleRegisterAgent, ip: 10.244.0.21, port: 44559, pid: 49

settings

## Head
    Command:
      /bin/bash
      -c
      --
    Args:
      ulimit -n 65536; ray start --head  --redis-password=LetMeInRay  --object-store-memory=100000000  --port=6379  --node-manager-port=12346  --object-manager-port=12345  --dashboard-host=0.0.0.0  --node-ip-address=$MY_POD_IP  --num-cpus=1  --block

## Worker
 Command:
      /bin/bash
      -c
      --
    Args:
      ulimit -n 65536; ray start  --block  --node-ip-address=$MY_POD_IP  --redis-password=LetMeInRay  --address=raycluster-complete-head-svc:6379

Anything I am missing something here?

@chenk008
Copy link
Contributor Author

chenk008 commented Dec 2, 2021

yes, that is a good idea, to have one example using the --block, and the others without it

I think --block should be default config. Without the --block flag, the Pod will be useless if the raylet exited. The ability of failover is a basic requirement.

@akanso
Copy link
Collaborator

akanso commented Dec 2, 2021

I am good with the PR.

Can we add a yaml comment. in the file # Without the --block flag ...

just to explain to the user of the example the impact of the --block

@chenk008
Copy link
Contributor Author

chenk008 commented Dec 2, 2021

Yeah, I will add a yaml comment.

@chenk008
Copy link
Contributor Author

chenk008 commented Dec 2, 2021

@Jeffwan I think there is some issue in ray core.

I did the same test in ray:1.8, when the head node exited and restarted, the other raylet still alive and not reconnect to the head node except the raylet on head node, on dashboard we can see only one host which is head.

The other raylet will exit 14 minutes after the head exited. The log show below

[2021-12-01 17:51:39,367 W 8 8] global_state_accessor.cc:427: Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
[2021-12-01 17:51:40,369 W 8 8] global_state_accessor.cc:427: Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?
2021-12-01 17:51:37,194	INFO scripts.py:740 -- Local node IP: 10.244.0.19
2021-12-01 17:51:41,372	SUCC scripts.py:748 -- --------------------
2021-12-01 17:51:41,372	SUCC scripts.py:749 -- Ray runtime started.
2021-12-01 17:51:41,372	SUCC scripts.py:750 -- --------------------
2021-12-01 17:51:41,372	INFO scripts.py:752 -- To terminate the Ray runtime, run
2021-12-01 17:51:41,372	INFO scripts.py:753 --   ray stop
2021-12-01 17:51:41,372	INFO scripts.py:757 -- --block
2021-12-01 17:51:41,372	INFO scripts.py:759 -- This command will now block until terminated by a signal.
2021-12-01 17:51:41,372	INFO scripts.py:761 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly.
2021-12-01 18:05:16,547	ERR scripts.py:769 -- Some Ray subprcesses exited unexpectedly:
2021-12-01 18:05:16,548	ERR scripts.py:776 -- raylet [exit code=1]
2021-12-01 18:05:16,548	ERR scripts.py:780 -- Remaining processes will be killed.

@chenk008 chenk008 merged commit 3102c53 into ray-project:master Dec 2, 2021
Jeffwan added a commit that referenced this pull request Mar 14, 2022
lowang-bh pushed a commit to lowang-bh/kuberay that referenced this pull request Sep 24, 2023
* use ray start block

Signed-off-by: chenk008 <[email protected]>

* add block into rayStartParams

* fix ut

* add block in sample config

* add sample without block

Co-authored-by: wuhua.ck <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Restart worker pod when raylet exited
3 participants