-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trouble scaling cudf merge benchmark with ucx ~64 nodes on NERSC perlmutter #930
Comments
Thanks @lastephey for the very detailed report.
The error seems to be coming from the mlx5 interface, therefore I think the RMM pool size should be ok (assuming all GPUs in the cluster have more than 35GB).
Based on the link you posted and from our experience, memlock was always problematic and in all systems we use UCX we had to ask our IT to set that to unlimited. Would you be able to confirm what is the state of memlock on Perlmutter, and perhaps set it to unlimited? If this is not something that you have permission to set, it may be worth checking with IT whether it would be possible to set to unlimited. I have also asked internally whether people have additional ideas of what that could be, but today and tomorrow NVIDIA folks are OOO, so I may not have any further details until Monday. |
Hi @pentschev, Thanks for your quick reply. We don't have a
I'm not sure if the Yes, all of our GPUs are A100s with 40 GB each, so it sounds like we should be fine in terms of GPU memory. Thanks for the information. |
I clarified with our systems team that this means memlock is already set to unlimited. |
I checked internally and it seems like the usual UCX workaround for that would be to set It seems another alternative would be increasing
Could you check whether it is possible to increase that? |
I was working on #834 so we get around the fork/spawn issue. I can probably pick this back up next week |
I think that multiprocessings A small testcase suggests this may work: Consider: import sys
from multiprocessing import forkserver
from dask.distributed.utils import mp_context
if __name__ == "__main__":
if len(sys.argv) > 1:
forkserver.ensure_running()
print(mp_context)
from dask_cuda import LocalCUDACluster
cluster = LocalCUDACluster(protocol="ucx", CUDA_VISIBLE_DEVICES=[0]) If I do:
Which is at least minimal progress. It looks like there are a few other places where a multiprocessing context is created:
|
@wence- would you be willing to open a PR in Distributed to add a similar config for the hardcoded value in |
Presume s/Distributed/UCX-Py/, but yes, can do. |
Sorry, for a moment I confused |
If that ends up working for @lastephey we should figure out a more principled way of doing this. The import-order dance seems sufficiently delicate that I couldn't figure out how to set the dask.distributed config via code (rather than environment variables) in a way that worked. |
I can give @wence-'s example a shot. Since we are not using |
Yes, the |
We've seen issues with forking pop up in an mpi context too (more info here). Our suggestion has been to set
Do you think that would be helpful here as well? Re: @wence- 's solution, it seems like we'll have to launch the cluster with the API rather than the CLI. Is that a fair assessment? |
Sorry, maybe I spoke too soon. Looks like I can go in and modify |
I am not totally sure those variables will be necessary, the link mentions problems with Python subprocessing in mpi4py, subprocessing also applies to Dask, so it is possible those will help. However, there aren't much more details on what the problems are exactly, so it seems difficult to say for sure. Perhaps, if resources and time permit, try both to see what happens? There's also the following comment on that link:
Judging by that, it seems like we will anyway need to find a more robust solution that would not lead to Dask processes forking/spawning, and there @quasiben 's solution from #834 may come in handy, but maybe that would be only an initial step and we would need more work still. |
I am not sure if the scheduler forks/spawns, so maybe it isn't needed there. Also note that the Dask-CUDA benchmarks use |
FWIW, the forkserver approach is robust (it isolates the forker from the IB stack, and I have existence proofs running on big machines), one just has to track down all the sequencing to ensure that all relevant processes set things up correctly. |
My small 4 node tests with the forkserver completed, so no problems there. Just have to wait in the queue for my larger jobs to complete. |
I didn't mean to say that the forkserver is not robust enough for this, I only meant to say that we may be able to come up with a solution where we don't need to spawn new processes at all, which would hopefully make things clearer. |
My 32 node test completed successfully, but my 64 node test failed (job log here). I think the important part of the log is around line 4083490:
I did add the multiprocessing forkserver to I'll try testing with those two additional environment variables just in case they are helpful since the UCX error is still related to forking. |
In the case with
and the previous forkserver configuration, the 64 node job still failed, but with different errors than with the forksever alone (job log here). I couldn't find any instance of Some of the more serious looking errors are:
on line 3250600 of the job log
on line 3250647 of the job log and
on line 3252406 of the job log. |
|
Actually, scratch that. In the logs I notice that we have |
That looked right, unfortunately we also needed to ensure that dask/distributed use the forkserver multiprocessing context. Can you add diff -u nersc-ben-benchmark.sh.orig nersc-ben-benchmark.sh
--- nersc-ben-benchmark.sh.orig 2022-06-15 12:58:51.000000000 +0100
+++ nersc-ben-benchmark.sh 2022-06-15 12:59:28.000000000 +0100
@@ -37,6 +37,7 @@
UCX_TCP_MAX_CONN_RETRIES=255 \
DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=3600s \
DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP=3600s \
+DASK_DISTRIBUTED__WORKER__MULTIPROCESSING_METHOD=forkserver \
python -m distributed.cli.dask_scheduler \
--protocol $protocol \
--interface hsn0 \
@@ -66,6 +67,7 @@
UCX_TCP_MAX_CONN_RETRIES=255 \
DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=3600s \
DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP=3600s \
+ DASK_DISTRIBUTED__WORKER__MULTIPROCESSING_METHOD=forkserver \
UCX_MAX_RNDV_RAILS=1 \
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda \
UCX_MEMTYPE_CACHE=n \
@@ -79,6 +81,7 @@
UCX_TCP_MAX_CONN_RETRIES=255 \
DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=3600s \
DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP=3600s \
+ DASK_DISTRIBUTED__WORKER__MULTIPROCESSING_METHOD=forkserver \
UCX_MAX_RNDV_RAILS=1 \
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda \
UCX_MEMTYPE_CACHE=n \
@@ -100,6 +103,7 @@
echo "Client start: $(date +%s)"
DASK_DISTRIBUTED__COMM__TIMEOUTS__CONNECT=3600s \
DASK_DISTRIBUTED__COMM__TIMEOUTS__TCP=3600s \
+ DASK_DISTRIBUTED__WORKER__MULTIPROCESSING_METHOD=forkserver \
UCX_MAX_RNDV_RAILS=1 \
UCX_TCP_MAX_CONN_RETRIES=255 \
UCX_MEMTYPE_REG_WHOLE_ALLOC_TYPES=cuda \
Diff finished. Wed Jun 15 13:02:52 2022 |
With dask/distributed#6580 and #933 you should be able to run any of the dask-cuda benchmarks with |
Ok, I reran with the The 64 node job failed. The output looks similar to the failure I saw last night (line 4171401):
Maybe UCX/UCX-py needs to be instructed not to fork as well? |
It seems like in your jobscript you're not setting |
That's correct @pentschev. I misunderstood your earlier comment. I'll re-run with that addition. |
Thanks for the patience and persistence @lastephey , and apologies if my comment was unclear. |
No problem-- I appreciate all of your help @pentschev @wence- and @quasiben. So far it seems like |
My 64 node test hit timeout at 90 minutes. I think maybe I do see things like this appearing in the job log around line 2123218:
|
I found one further place where a fork happens. As a temporary fix can you but yes, that env var may help but that slowdown is unacceptable. |
Thanks @wence-. I'm a still a bit unclear on what exactly needs to be done though- do we need both the |
Ah sorry, my hope is that if we track down all the instances of forking we do not need UCX_IB_FORK_INIT=n But I am not fully sure that this will be enough. |
Thanks for clarifying. Ok, I'll disable |
My latest 64 node test failed (job log here, unfortunately still with the same types of error messages:
Confirming that I did remove
My assessment is that we must still be missing some instances of forking. What do you all think? |
It is possible, I think we may have to have a second on all involved bits. @wence- should be out for some time on PTO, but I will pick this up with @quasiben next week, I think he also has access to Perlmutter which should allow us also provide us another point of support for debugging. This is anyway gonna be exciting when it is resolved! |
Ok sounds good. FYI I'll also be out on vacation all next week. Please feel free to keep posting updates. If you get stuck with Slurm/queues/anything NERSC-related, hopefully @rcthomas can help you out. I'll be in the wilderness for a few days, but I'll try to respond when I can. Again, I really appreciate all of your help troubleshooting this. |
Allows selection of the method multiprocessing uses to start child processes. Additionally, in the forkserver case, ensure the fork server is up and running before any computation happens. Potentially fixes #930. Needs dask/distributed#6580. cc: @pentschev, @quasiben Authors: - Lawrence Mitchell (https://github.com/wence-) Approvers: - Peter Andreas Entschev (https://github.com/pentschev) URL: #933
I'm pretty sure that merge does let things progress further, but I still have some tests outstanding |
To update, I think that we managed to chase down all the places that the dask-cuda toolchain calls In particular one needs
To run we must set Run-scripts for a SLURM based submission system can be seen here https://github.com/wence-/dask-cuda-benchmarks/tree/main/perlmutter, here are some unprocessed results. Some aspects of the dask-cuda benchmark data aggregation are not currently scalable: I will address this in #940. |
Thanks for all your work on this. I've been adapting your test scripts to run with a container (which at NERSC is Shifter) to try to get better and more uniform performance at scale (and also avoid filesystem quota issues). You can see my work here. Shifter images have been squashed and as a result, are read-only. I think the dask nannies are attempting to create directories in places like Reading the nanny source code, it looks like we should be able to control this location via the
and also
neither seem to have any effect. Is there a way to do this, preferably via an environment variable setting? For reference, here is a more complete traceback:
|
I think you can control this via the Note that if you're pointing at a shared filesystem you should also say |
Thank you! Actually the problem was that we had set These edits to job.sh:
seem to do what I need. If I understand the docs, it sounds like I think this fixes my issue. I have some minor environment issues to fix (still need |
Hi @lastephey, great. Just to note that we've merged the benchmark updates in #940, so no need to be on any strange branches. I've updated the job scripts to match. We now have to pass Let us know if the runs work for you, and if we can go ahead and close this particular issue. |
Great, thank you. Let me rebuild the image with your updates in #940 and verify that it scales. Then yes I am happy to close. |
Ok, I think everything seems to be running up to at least 64 nodes now! You can find some early results here. I'll keep adding as I go. If you ever want to test yourself, running in the container should require only a handful of path changes. I see some errors in the
but I think it's just the cluster complaining a bit while it shuts down. If those messages look normal/expected to you, please feel free to close this. Thank you very much for all your help and work on this @wence- and @quasiben! |
Yes, these are expected (although it would be nice if they didn't appear at all) |
Closing, thanks for your patience and help @lastephey |
Dear dask-cuda devs,
We've been working with @quasiben to do some benchmarking with Dask UCX on NERSC's Perlmutter system. (You can see some background in @rcthomas issue here). So far I've been able to get the TCP-based cudf merge benchmark to scale to about 256 nodes (it hit timeout but I think it would have finished) but I cannot get the UCX-based benchmark to run successfully at or above 64 nodes. The good news is that when it runs, UCX is quite a bit faster than TCP, but of course it would be nice to get it to scale up. You can view a summary of the benchmarking data I've gathered so far.
More details:
A few months ago @quasiben kindly provided his benchmark script which I have lightly modified for my recent benchmarking efforts. You can view my modified version here.
I am using UCX 1.12.1 and the 22.06 rapids-nightly build. You can view my buildscript here. Note this is all being done outside a container using the NERSC shared filesystems.
I have set
export UCXPY_LOG_LEVEL=DEBUG
and included the logs of all of my trials here. Probably the most useful log is the UCX 64 node run. It's really huge and hard to navigate (sorry), but around line 4103066 you start to see errors likeSeeing these I shrank the problem a bit compared to what @quasiben had done in his benchmark (40k vs 50k). I also made
--rmm-pool-size 35GB
a bit smaller. I also bumped up the ulimit as far as it can go on Perlmutter, although I don't know if this matters.Do you think this is really a memory issue? Based on this it sounds like it could be an RDMA config issue, but I am way out of my depth here so I'd appreciate any advice you have.
Thank you very much,
Laurie
The text was updated successfully, but these errors were encountered: