Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexplained slowness in @everywhere remotecalls with imports #44645

Closed
exaexa opened this issue Mar 16, 2022 · 2 comments · Fixed by #44671
Closed

Unexplained slowness in @everywhere remotecalls with imports #44645

exaexa opened this issue Mar 16, 2022 · 2 comments · Fixed by #44671
Labels
parallelism Parallel or distributed computation

Comments

@exaexa
Copy link
Contributor

exaexa commented Mar 16, 2022

Hello there!

So I spent some time debugging an issue similar to #42156 and perhaps #39291 and arrived at 2 different possible sources of slowness:

  • @everywhere uses @sync_add from remotecall_eval to add Futures into a queue that gets wait()ed in the end. Unfortunately, waiting for a future seems to cause some additional processing on each remote worker, which is run serially. I managed to get fully parallel and fast imports by doing this manually (code below).
  • For reasonable reasons, @everywhere loads all imports locally first. Which sounds like a good idea, but for unidentified reason (possibly this one) this causes a behavior where it "goes slowly through all workers" for me again.

To test it, I'm starting with:

using Distributed
addprocs(32)

If the program continues with

@time @everywhere workers() using Symbolics

I usually get something like 45 seconds. (tested on 16c dell xps, lots of RAM, the thing is not specific to Symbolics) The main problem is the way this uses CPU: First, Symbolics are imported locally (several seconds 1CPU), then there's the parallel part (~15s of all CPUs working) with actual loading, then there's the part when the results are being wait()ed for serially (another 15s, only 1 CPU works at a time), and finally there's a bit of extra parallel processing (roughly 10 seconds, for which I don't have any explanation).

This code manages to import the package in around 17 seconds, and the CPU use profile looks correctly "full" in this case, no gaps!

@time asyncmap(fetch, [remotecall(()->Base.eval(Main, :(using Symbolics)), pid) for pid in workers()])

This doesn't import Symbolics locally, which should take like 3 seconds (see below), but it's still not a fair comparison.

So I tried removing the line with auto-imports from @everywhere at

$(isempty(imps) ? nothing : Expr(:toplevel, imps...)) # run imports locally first
, to make the "actions" comparable. That shrunk the time from around 45 to 33 seconds. Looking at the difference, I really wanted to understand what's so bad on this (it should be 3s not 12s!), so tried adding the import using the asyncmap code.

Unfortunately, importing the package first throws the time back to ~45 seconds

@time begin
    using Symbolics
    asyncmap(fetch, [remotecall(()->Base.eval(Main, :(using Symbolics)), pid) for pid in workers()])
end

...despite the time for importing Symbolics and time for running the asyncmap should add to under 20 seconds:

julia> @time @eval using Symbolics
  3.233303 seconds (9.33 M allocations: 634.171 MiB, 1.71% gc time, 21.58% compilation time)

Also happens on other packages (e.g. JuMP), but for Symbolics it behaves particularly weird.

Questions:

  • is there any explanation of what's happening?
  • If there's an explanation for why the import slows the thing down, would a patch to @everywhere be acceptable that replaces the @sync_add with asynchronous waiting for the futures?
  • EDIT: extra thought since yesterday: Is there any good way to debug/profile/log what's actually happening and getting processed at the workers? Obviously, throwing a @profile into the command doesn't reveal much.

Notably, fixing this would allow happy julia parallel computing on like 1000s of worker nodes with just basic Distributed and no dirty tricks. I managed to test the asyncmap on HPC with 1024 workers, loading our packages there shrinks from ~350 seconds with @everywhere to nice 10 seconds.

Thanks!
-mk

@giordano giordano added the parallelism Parallel or distributed computation label Mar 16, 2022
@exaexa
Copy link
Contributor Author

exaexa commented Mar 17, 2022

Extra observation: Simpler packages do not cause the weird problem of increasing the time when using is around, looks like it's only the ones that overload stuff from other packages. So likely very related to the precompile generation. (Need to measure precisely though.)

Problem with serial-ish @everywhere is still there (e.g., 9 vs 21 seconds with UnicodePlots).

@exaexa
Copy link
Contributor Author

exaexa commented Mar 18, 2022

OK turns out that importing stuff after the workers are added is actually causes some communication with workers, which is visibly unoptimized (and in turn slow)...

This is a transcript from julia with debug print-outs in Distributed handle_msg:

julia> using Distributed

julia> addprocs(1)
1-element Vector{Int64}:
 2

julia>       From worker 2:	2 got header: Distributed.MsgHeader(Distributed.RRID(0, 0), Distributed.RRID(0, 0))
      From worker 2:	2 got msg T: Distributed.RemoteDoMsg
      From worker 2:	2 got msg: Distributed.RemoteDoMsg(Distributed.set_valid_processes, ([2],), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
julia> 

julia> using UnicodePlots
[ Info: Precompiling UnicodePlots [b8865327-cd53-5732-bb35-84acbb429228]
      From worker 2:	2 got header: Distributed.MsgHeader(Distributed.RRID(1, 3), Distributed.RRID(0, 0))
      From worker 2:	2 got msg T: Distributed.CallMsg{:call}
      From worker 2:	2 got msg: Distributed.CallMsg{:call}(Distributed.var"#1#2"{Base.PkgId}(UnicodePlots [b8865327-cd53-5732-bb35-84acbb429228]), (), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
      From worker 2:	2 got header: Distributed.MsgHeader(Distributed.RRID(0, 0), Distributed.RRID(1, 4))
      From worker 2:	2 got msg T: Distributed.CallMsg{:call_fetch}
      From worker 2:	2 got msg: Distributed.CallMsg{:call_fetch}(Distributed.wait_ref, (Distributed.RRID(1, 3), 1), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())
1 got header: Distributed.MsgHeader(Distributed.RRID(1, 4), Distributed.RRID(0, 0))
1 got msg T: Distributed.ResultMsg
1 got msg: Distributed.ResultMsg(nothing)

julia> 
      From worker 2:	2 got header: Distributed.MsgHeader(Distributed.RRID(0, 0), Distributed.RRID(0, 0))
      From worker 2:	2 got msg T: Distributed.RemoteDoMsg
      From worker 2:	2 got msg: Distributed.RemoteDoMsg(exit, (), Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}())

I was not able to find the source of that call in order to optimize it. I guess a similar fix as for the @everywhere could help there.

EDIT: The workers do not seem to get any information about UnicodePlots when the package loaded before adding workers, which looks a bit fishy. (Unless it gets there using a different channel)

EDIT2: Ah, found it, expect a patch. :D

KristofferC pushed a commit that referenced this issue Mar 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
KristofferC pushed a commit that referenced this issue Mar 25, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue Apr 20, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue May 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue May 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue Jul 4, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
KristofferC pushed a commit that referenced this issue Dec 21, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
staticfloat pushed a commit that referenced this issue Dec 23, 2022
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 62e0729)
vchuravy pushed a commit to JuliaLang/Distributed.jl that referenced this issue Oct 6, 2023
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes JuliaLang/julia#44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see JuliaLang/julia#44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes JuliaLang/julia#39291.
- JuliaLang/julia#42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with JuliaLang/julia#38931

Co-authored-by: Valentin Churavy <[email protected]>
(cherry picked from commit 3b57a49)
Keno pushed a commit that referenced this issue Jun 5, 2024
* avoid using `@sync_add` on remotecalls

It seems like @sync_add adds the Futures to a queue (Channel) for @sync, which
in turn calls wait() for all the futures synchronously. Not only that is
slightly detrimental for network operations (latencies add up), but in case of
Distributed the call to wait() may actually cause some compilation on remote
processes, which is also wait()ed for. In result, some operations took a great
amount of "serial" processing time if executed on many workers at once.

For me, this closes #44645.

The major change can be illustrated as follows: First add some workers:

```
using Distributed
addprocs(10)
```

and then trigger something that, for example, causes package imports on the
workers:

```
using SomeTinyPackage
```

In my case (importing UnicodePlots on 10 workers), this improves the loading
time over 10 workers from ~11s to ~5.5s.

This is a far bigger issue when worker count gets high. The time of the
processing on each worker is usually around 0.3s, so triggering this problem
even on a relatively small cluster (64 workers) causes a really annoying delay,
and running `@everywhere` for the first time on reasonable clusters (I tested
with 1024 workers, see #44645) usually takes more than 5 minutes. Which sucks.

Anyway, on 64 workers this reduces the "first import" time from ~30s to ~6s,
and on 1024 workers this seems to reduce the time from over 5 minutes (I didn't
bother to measure that precisely now, sorry) to ~11s.

Related issues:
- Probably fixes #39291.
- #42156 is a kinda complementary -- it removes the most painful source of
  slowness (the 0.3s precompilation on the workers), but the fact that the
  wait()ing is serial remains a problem if the network latencies are high.

May help with #38931

Co-authored-by: Valentin Churavy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parallelism Parallel or distributed computation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants