Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bench_SSSP Failing in bench_algos.py due to Dask Error #4511

Closed
2 tasks done
nv-rliu opened this issue Jun 28, 2024 · 0 comments · Fixed by #4541
Closed
2 tasks done

bench_SSSP Failing in bench_algos.py due to Dask Error #4511

nv-rliu opened this issue Jun 28, 2024 · 0 comments · Fixed by #4541
Assignees
Labels
benchmarks bug Something isn't working graph-devops Issues for the graph-devops team
Milestone

Comments

@nv-rliu
Copy link
Contributor

nv-rliu commented Jun 28, 2024

Version

24.08

Which installation method(s) does this occur on?

Source

Describe the bug.

The SSSP Algorithm being run in cugraph/benchmarks/cugraph/pytest-based/bench_algos.py is failing due to what we suspect is a Dask Error

Minimum reproducible example

pytest -v --import-mode=append bench_algos.py

Relevant log output

06/28/24-11:18:07.934870578_UTC>>>> NODE 0: ******** STARTING BENCHMARK FROM: ./bench_algos.py::bench_sssp, using 8 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rapids_pytest_benchmark: 0.0.15
rootdir: /root/cugraph/benchmarks
configfile: pytest.ini
plugins: cov-5.0.0, benchmark-4.0.0, rapids-pytest-benchmark-0.0.15
collecting ... collected 720 items / 719 deselected / 1 selected

bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True] /opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.02s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.08s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.02s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.08s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.02s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.03s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.05s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.03s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
[1719573501.816108] [rno1-m02-b07-dgx1-012:2539539:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.816108] [rno1-m02-b07-dgx1-012:2539539:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.818897] [rno1-m02-b07-dgx1-012:2539535:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.818897] [rno1-m02-b07-dgx1-012:2539535:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.818432] [rno1-m02-b07-dgx1-012:2539544:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.818432] [rno1-m02-b07-dgx1-012:2539544:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.817143] [rno1-m02-b07-dgx1-012:2539548:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.817143] [rno1-m02-b07-dgx1-012:2539548:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.819079] [rno1-m02-b07-dgx1-012:2539560:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.819079] [rno1-m02-b07-dgx1-012:2539560:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.815591] [rno1-m02-b07-dgx1-012:2539552:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.815591] [rno1-m02-b07-dgx1-012:2539552:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.819155] [rno1-m02-b07-dgx1-012:2539556:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.819155] [rno1-m02-b07-dgx1-012:2539556:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.812650] [rno1-m02-b07-dgx1-012:2539563:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.812650] [rno1-m02-b07-dgx1-012:2539563:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2024-06-28 04:18:41,346 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-34c6b819885cefbf2523014a144811a9
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x147929a789f0>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,359 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-22dc6dd0859756e2a49bab5f0d7a888e
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x15283f56ae30>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,366 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-80268bf8174714b457bb71d2484b69a7
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x14545cf5f530>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,370 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-d624b0c0c32a2102cc419aa0596c6d19
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x15347217dab0>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,380 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-9a92f2aaf91ec752f6b47fe090f53526
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x155237245430>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,391 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-6f8df8539f1a639b0af6f1a8794da1d6
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x1534122f30b0>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,404 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-478197bce70a075fd0bc6ffe2c1384a6
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x155001c9d6f0>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,406 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-f603628522884f48855d7ee4076384a5
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x14f3a1e58370>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"


Dask client/cluster created using LocalCUDACluster
FAILED/opt/conda/lib/python3.10/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Not saving anything, no benchmarks have been run!
  warner(PytestBenchmarkWarning(text))

Dask client closed.


=================================== FAILURES ===================================
________________ bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True] _________________

gpubenchmark = <rapids_pytest_benchmark.plugin.GPUBenchmarkFixture object at 0x147f6673b730>
graph = <cugraph.structure.graph_classes.Graph object at 0x147f67277070>

    def bench_sssp(gpubenchmark, graph):
        sssp = dask_cugraph.sssp if is_graph_distributed(graph) else cugraph.sssp
        start = graph.edgelist.edgelist_df["src"][0]
>       gpubenchmark(sssp, graph, start)

bench_algos.py:327: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:125: in __call__
    return self._raw(function_to_benchmark, *args, **kwargs)
/opt/conda/lib/python3.10/site-packages/rapids_pytest_benchmark/plugin.py:322: in _raw
    function_result = super()._raw(function_to_benchmark, *args, **kwargs)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:147: in _raw
    duration, iterations, loops_range = self._calibrate_timer(runner)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:275: in _calibrate_timer
    duration = runner(loops_range)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:90: in runner
    function_to_benchmark(*args, **kwargs)
/opt/conda/lib/python3.10/site-packages/cugraph/dask/traversal/sssp.py:150: in sssp
    ddf = dask_cudf.from_delayed(result).persist()
/opt/conda/lib/python3.10/site-packages/dask_expr/io/_delayed.py:104: in from_delayed
    meta = delayed(make_meta)(dfs[0]).compute()
/opt/conda/lib/python3.10/site-packages/dask/base.py:375: in compute
    (result,) = compute(self, traverse=False, **kwargs)
/opt/conda/lib/python3.10/site-packages/dask/base.py:661: in compute
    results = schedule(dsk, keys, **kwargs)
/opt/conda/lib/python3.10/site-packages/cugraph/dask/traversal/sssp.py:28: in _call_plc_sssp
    vertices, distances, predecessors = pylibcugraph_sssp(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   TypeError: an integer is required

sssp.pyx:54: TypeError
------------------------------ Captured log setup ------------------------------
INFO     distributed.http.proxy:proxy.py:85 To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO     distributed.scheduler:scheduler.py:1711 State start
INFO     distributed.scheduler:scheduler.py:4072   Scheduler at:     tcp://127.0.0.1:44321
INFO     distributed.scheduler:scheduler.py:4087   dashboard at:  http://127.0.0.1:8787/status
INFO     distributed.scheduler:scheduler.py:7874 Registering Worker plugin shuffle
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:46459'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:44059'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:40211'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:38331'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:45065'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:33121'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:44841'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:40349'
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:38153', name: 2, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:38153
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59152
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:46845', name: 5, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:46845
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59154
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:39813', name: 1, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:39813
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59156
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:42007', name: 3, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:42007
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59148
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:45179', name: 6, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:45179
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59160
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:41977', name: 0, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:41977
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59150
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:43437', name: 4, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:43437
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59158
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:39211', name: 7, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:39211
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59162
INFO     distributed.scheduler:scheduler.py:5686 Receive client connection: Client-209aa2f0-3540-11ef-bff4-a81e84c35e35
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59164
INFO     distributed.worker:worker.py:3178 Run out-of-band function '_func_set_scheduler_as_nccl_root'
INFO     distributed.protocol.core:core.py:111 Failed to serialize (can not serialize 'dict_keys' object); falling back to pickle. Be aware that this may degrade performance.
INFO     distributed.protocol.core:core.py:111 Failed to serialize (can not serialize 'dict_keys' object); falling back to pickle. Be aware that this may degrade performance.
---------------------------- Captured log teardown -----------------------------
INFO     distributed.worker:worker.py:3178 Run out-of-band function '_func_destroy_scheduler_session'
INFO     distributed.scheduler:scheduler.py:5730 Remove client Client-209aa2f0-3540-11ef-bff4-a81e84c35e35
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59164; closing.
INFO     distributed.scheduler:scheduler.py:5730 Remove client Client-209aa2f0-3540-11ef-bff4-a81e84c35e35
INFO     distributed.scheduler:scheduler.py:5722 Close client connection: Client-209aa2f0-3540-11ef-bff4-a81e84c35e35
INFO     distributed.scheduler:scheduler.py:7317 Retire worker addresses (0, 1, 2, 3, 4, 5, 6, 7)
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:46459'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:44059'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:40211'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:38331'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:45065'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:33121'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:44841'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:40349'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59162; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59150; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59148; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59158; closing.
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:39211', name: 7, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2782512')
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59156; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59152; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59160; closing.
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:41977', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2792294')
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59154; closing.
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:42007', name: 3, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2796857')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:43437', name: 4, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.27991')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:39813', name: 1, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2805834')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:38153', name: 2, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2808607')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:45179', name: 6, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2810738')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:46845', name: 5, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2812953')
INFO     distributed.scheduler:scheduler.py:5331 Lost all workers
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:44321 remote=tcp://127.0.0.1:59152>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:44321 remote=tcp://127.0.0.1:59156>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:44321 remote=tcp://127.0.0.1:59160>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:44321 remote=tcp://127.0.0.1:59154>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.scheduler:scheduler.py:4146 Scheduler closing due to unknown reason...
INFO     distributed.scheduler:scheduler.py:4164 Scheduler closing all comms
=============================== warnings summary ===============================
cugraph/pytest-based/bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True]
cugraph/pytest-based/bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True]
  /opt/conda/lib/python3.10/site-packages/cudf/core/reshape.py:350: FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
    warnings.warn(

cugraph/pytest-based/bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True]
  cugraph/pytest-based/bench_algos.py:324: PytestBenchmarkWarning: Benchmark fixture was not used at all in this test!

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
- generated xml file: /gpfs/fs1/projects/sw_rapids/users/rratzel/cugraph-results/latest/benchmarks/8-GPU/bench_sssp/test_results.json -
=========================== short test summary info ============================
FAILED bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True] - TypeEr...
================ 1 failed, 719 deselected, 3 warnings in 36.32s ================
06/28/24-11:18:45.978123558_UTC>>>> NODE 0: pytest exited with code: 1, run-py-tests.sh overall exit code is: 1
06/28/24-11:18:46.081545937_UTC>>>> NODE 0: remaining python processes: [ 2535211 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
06/28/24-11:18:46.107405380_UTC>>>> NODE 0: remaining dask processes: [  ]

Environment details

Being run inside the nightly cugraph MNMG testing containers

Other/Misc.

@jnke2016 and @nv-rliu are looking into this.

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@nv-rliu nv-rliu added bug Something isn't working graph-devops Issues for the graph-devops team benchmarks labels Jun 28, 2024
@nv-rliu nv-rliu added this to the 24.08 milestone Jun 28, 2024
@rapids-bot rapids-bot bot closed this as completed in a41d6b0 Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks bug Something isn't working graph-devops Issues for the graph-devops team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants