`bench_SSSP` Failing in `bench_algos.py` due to Dask Error #4511

nv-rliu · 2024-06-28T15:08:22Z

Version

24.08

Which installation method(s) does this occur on?

Source

Describe the bug.

The SSSP Algorithm being run in cugraph/benchmarks/cugraph/pytest-based/bench_algos.py is failing due to what we suspect is a Dask Error

Minimum reproducible example

pytest -v --import-mode=append bench_algos.py

Relevant log output

06/28/24-11:18:07.934870578_UTC>>>> NODE 0: ******** STARTING BENCHMARK FROM: ./bench_algos.py::bench_sssp, using 8 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rapids_pytest_benchmark: 0.0.15
rootdir: /root/cugraph/benchmarks
configfile: pytest.ini
plugins: cov-5.0.0, benchmark-4.0.0, rapids-pytest-benchmark-0.0.15
collecting ... collected 720 items / 719 deselected / 1 selected

bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True] /opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.02s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.08s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.02s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.08s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.02s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.03s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.05s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
/opt/conda/lib/python3.10/contextlib.py:142: UserWarning: Creating scratch directories is taking a surprisingly long time. (1.03s) This is often due to running workers on a network file system. Consider specifying a local-directory to point workers to write scratch data to a local disk.
  next(self.gen)
[1719573501.816108] [rno1-m02-b07-dgx1-012:2539539:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.816108] [rno1-m02-b07-dgx1-012:2539539:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.818897] [rno1-m02-b07-dgx1-012:2539535:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.818897] [rno1-m02-b07-dgx1-012:2539535:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.818432] [rno1-m02-b07-dgx1-012:2539544:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.818432] [rno1-m02-b07-dgx1-012:2539544:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.817143] [rno1-m02-b07-dgx1-012:2539548:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.817143] [rno1-m02-b07-dgx1-012:2539548:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.819079] [rno1-m02-b07-dgx1-012:2539560:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.819079] [rno1-m02-b07-dgx1-012:2539560:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.815591] [rno1-m02-b07-dgx1-012:2539552:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.815591] [rno1-m02-b07-dgx1-012:2539552:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.819155] [rno1-m02-b07-dgx1-012:2539556:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.819155] [rno1-m02-b07-dgx1-012:2539556:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
[1719573501.812650] [rno1-m02-b07-dgx1-012:2539563:0]          parser.c:2036 UCX  WARN  unused environment variable: UCX_MEMTYPE_CACHE (maybe: UCX_MEMTYPE_CACHE?)
[1719573501.812650] [rno1-m02-b07-dgx1-012:2539563:0]          parser.c:2036 UCX  WARN  (set UCX_WARN_UNUSED_ENV_VARS=n to suppress this warning)
2024-06-28 04:18:41,346 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-34c6b819885cefbf2523014a144811a9
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x147929a789f0>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,359 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-22dc6dd0859756e2a49bab5f0d7a888e
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x15283f56ae30>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,366 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-80268bf8174714b457bb71d2484b69a7
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x14545cf5f530>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,370 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-d624b0c0c32a2102cc419aa0596c6d19
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x15347217dab0>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,380 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-9a92f2aaf91ec752f6b47fe090f53526
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x155237245430>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,391 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-6f8df8539f1a639b0af6f1a8794da1d6
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x1534122f30b0>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,404 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-478197bce70a075fd0bc6ffe2c1384a6
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x155001c9d6f0>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"

2024-06-28 04:18:41,406 - distributed.worker - WARNING - Compute Failed
Key:       _call_plc_sssp-f603628522884f48855d7ee4076384a5
Function:  _call_plc_sssp
args:      (b'K\xe15$\xb2\xc3KF\x9e\xa5\xcc2\xc0B\x1f\xf3', <pylibcugraph.graphs.MGGraph object at 0x14f3a1e58370>, Dask Series Structure:
npartitions=16
    int32
      ...
    ...  
      ...
      ...
Dask Name: try_loc, 26 expressions
Expr=LocUnknown(frame=(Concat(frames=[ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['src'], name='src'), ToFrame(frame=(Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))['dst'], name='dst'), ((Repartition(frame=ResetIndex(frame=MapPartitions(_add_reverse_edges), drop=True), new_partitions=16))[Index(['value'], dtype='object')])[['value']]], axis=1))['src'], iindexer=slice(0, 0, None)), inf, True, False)
kwargs:    {}
Exception: "TypeError('an integer is required')"


Dask client/cluster created using LocalCUDACluster
FAILED/opt/conda/lib/python3.10/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Not saving anything, no benchmarks have been run!
  warner(PytestBenchmarkWarning(text))

Dask client closed.


=================================== FAILURES ===================================
________________ bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True] _________________

gpubenchmark = <rapids_pytest_benchmark.plugin.GPUBenchmarkFixture object at 0x147f6673b730>
graph = <cugraph.structure.graph_classes.Graph object at 0x147f67277070>

    def bench_sssp(gpubenchmark, graph):
        sssp = dask_cugraph.sssp if is_graph_distributed(graph) else cugraph.sssp
        start = graph.edgelist.edgelist_df["src"][0]
>       gpubenchmark(sssp, graph, start)

bench_algos.py:327: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:125: in __call__
    return self._raw(function_to_benchmark, *args, **kwargs)
/opt/conda/lib/python3.10/site-packages/rapids_pytest_benchmark/plugin.py:322: in _raw
    function_result = super()._raw(function_to_benchmark, *args, **kwargs)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:147: in _raw
    duration, iterations, loops_range = self._calibrate_timer(runner)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:275: in _calibrate_timer
    duration = runner(loops_range)
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/fixture.py:90: in runner
    function_to_benchmark(*args, **kwargs)
/opt/conda/lib/python3.10/site-packages/cugraph/dask/traversal/sssp.py:150: in sssp
    ddf = dask_cudf.from_delayed(result).persist()
/opt/conda/lib/python3.10/site-packages/dask_expr/io/_delayed.py:104: in from_delayed
    meta = delayed(make_meta)(dfs[0]).compute()
/opt/conda/lib/python3.10/site-packages/dask/base.py:375: in compute
    (result,) = compute(self, traverse=False, **kwargs)
/opt/conda/lib/python3.10/site-packages/dask/base.py:661: in compute
    results = schedule(dsk, keys, **kwargs)
/opt/conda/lib/python3.10/site-packages/cugraph/dask/traversal/sssp.py:28: in _call_plc_sssp
    vertices, distances, predecessors = pylibcugraph_sssp(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   TypeError: an integer is required

sssp.pyx:54: TypeError
------------------------------ Captured log setup ------------------------------
INFO     distributed.http.proxy:proxy.py:85 To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
INFO     distributed.scheduler:scheduler.py:1711 State start
INFO     distributed.scheduler:scheduler.py:4072   Scheduler at:     tcp://127.0.0.1:44321
INFO     distributed.scheduler:scheduler.py:4087   dashboard at:  http://127.0.0.1:8787/status
INFO     distributed.scheduler:scheduler.py:7874 Registering Worker plugin shuffle
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:46459'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:44059'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:40211'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:38331'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:45065'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:33121'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:44841'
INFO     distributed.nanny:nanny.py:368         Start Nanny at: 'tcp://127.0.0.1:40349'
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:38153', name: 2, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:38153
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59152
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:46845', name: 5, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:46845
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59154
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:39813', name: 1, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:39813
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59156
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:42007', name: 3, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:42007
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59148
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:45179', name: 6, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:45179
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59160
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:41977', name: 0, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:41977
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59150
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:43437', name: 4, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:43437
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59158
INFO     distributed.scheduler:scheduler.py:4424 Register worker <WorkerState 'tcp://127.0.0.1:39211', name: 7, status: init, memory: 0, processing: 0>
INFO     distributed.scheduler:scheduler.py:5929 Starting worker compute stream, tcp://127.0.0.1:39211
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59162
INFO     distributed.scheduler:scheduler.py:5686 Receive client connection: Client-209aa2f0-3540-11ef-bff4-a81e84c35e35
INFO     distributed.core:core.py:1019 Starting established connection to tcp://127.0.0.1:59164
INFO     distributed.worker:worker.py:3178 Run out-of-band function '_func_set_scheduler_as_nccl_root'
INFO     distributed.protocol.core:core.py:111 Failed to serialize (can not serialize 'dict_keys' object); falling back to pickle. Be aware that this may degrade performance.
INFO     distributed.protocol.core:core.py:111 Failed to serialize (can not serialize 'dict_keys' object); falling back to pickle. Be aware that this may degrade performance.
---------------------------- Captured log teardown -----------------------------
INFO     distributed.worker:worker.py:3178 Run out-of-band function '_func_destroy_scheduler_session'
INFO     distributed.scheduler:scheduler.py:5730 Remove client Client-209aa2f0-3540-11ef-bff4-a81e84c35e35
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59164; closing.
INFO     distributed.scheduler:scheduler.py:5730 Remove client Client-209aa2f0-3540-11ef-bff4-a81e84c35e35
INFO     distributed.scheduler:scheduler.py:5722 Close client connection: Client-209aa2f0-3540-11ef-bff4-a81e84c35e35
INFO     distributed.scheduler:scheduler.py:7317 Retire worker addresses (0, 1, 2, 3, 4, 5, 6, 7)
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:46459'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:44059'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:40211'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:38331'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:45065'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:33121'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:44841'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.nanny:nanny.py:609 Closing Nanny at 'tcp://127.0.0.1:40349'. Reason: nanny-close
INFO     distributed.nanny:nanny.py:854 Nanny asking worker to close. Reason: nanny-close
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59162; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59150; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59148; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59158; closing.
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:39211', name: 7, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2782512')
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59156; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59152; closing.
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59160; closing.
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:41977', name: 0, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2792294')
INFO     distributed.core:core.py:1044 Received 'close-stream' from tcp://127.0.0.1:59154; closing.
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:42007', name: 3, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2796857')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:43437', name: 4, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.27991')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:39813', name: 1, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2805834')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:38153', name: 2, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2808607')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:45179', name: 6, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2810738')
INFO     distributed.scheduler:scheduler.py:5204 Remove worker <WorkerState 'tcp://127.0.0.1:46845', name: 5, status: closing, memory: 0, processing: 0> (stimulus_id='handle-worker-cleanup-1719573522.2812953')
INFO     distributed.scheduler:scheduler.py:5331 Lost all workers
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:44321 remote=tcp://127.0.0.1:59152>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:44321 remote=tcp://127.0.0.1:59156>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:44321 remote=tcp://127.0.0.1:59160>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.batched:batched.py:122 Batched Comm Closed <TCP (closed) Scheduler connection to worker local=tcp://127.0.0.1:44321 remote=tcp://127.0.0.1:59154>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/distributed/batched.py", line 115, in _background_send
    nbytes = yield coro
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/tcp.py", line 262, in write
    raise CommClosedError()
distributed.comm.core.CommClosedError
INFO     distributed.scheduler:scheduler.py:4146 Scheduler closing due to unknown reason...
INFO     distributed.scheduler:scheduler.py:4164 Scheduler closing all comms
=============================== warnings summary ===============================
cugraph/pytest-based/bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True]
cugraph/pytest-based/bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True]
  /opt/conda/lib/python3.10/site-packages/cudf/core/reshape.py:350: FutureWarning: The behavior of array concatenation with empty entries is deprecated. In a future version, this will no longer exclude empty items when determining the result dtype. To retain the old behavior, exclude the empty entries before the concat operation.
    warnings.warn(

cugraph/pytest-based/bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True]
  cugraph/pytest-based/bench_algos.py:324: PytestBenchmarkWarning: Benchmark fixture was not used at all in this test!

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
- generated xml file: /gpfs/fs1/projects/sw_rapids/users/rratzel/cugraph-results/latest/benchmarks/8-GPU/bench_sssp/test_results.json -
=========================== short test summary info ============================
FAILED bench_algos.py::bench_sssp[ds:rmat_mg_20_16-mm:False-pa:True] - TypeEr...
================ 1 failed, 719 deselected, 3 warnings in 36.32s ================
06/28/24-11:18:45.978123558_UTC>>>> NODE 0: pytest exited with code: 1, run-py-tests.sh overall exit code is: 1
06/28/24-11:18:46.081545937_UTC>>>> NODE 0: remaining python processes: [ 2535211 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
06/28/24-11:18:46.107405380_UTC>>>> NODE 0: remaining dask processes: [  ]

Environment details

Being run inside the nightly cugraph MNMG testing containers

Other/Misc.

@jnke2016 and @nv-rliu are looking into this.

Code of Conduct

I agree to follow cuGraph's Code of Conduct
I have searched the open bugs and have found no duplicates for this bug report

The text was updated successfully, but these errors were encountered:

nv-rliu added bug Something isn't working graph-devops Issues for the graph-devops team benchmarks labels Jun 28, 2024

nv-rliu assigned jnke2016 and nv-rliu Jun 28, 2024

nv-rliu added this to the 24.08 milestone Jun 28, 2024

nv-rliu mentioned this issue Jul 16, 2024

Add Additional Check For SSSP Source Vertex & Fix SSSP Benchmark #4541

Merged

rapids-bot bot closed this as completed in #4541 Jul 19, 2024

rapids-bot bot closed this as completed in a41d6b0 Jul 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`bench_SSSP` Failing in `bench_algos.py` due to Dask Error #4511

`bench_SSSP` Failing in `bench_algos.py` due to Dask Error #4511

nv-rliu commented Jun 28, 2024

bench_SSSP Failing in bench_algos.py due to Dask Error #4511

bench_SSSP Failing in bench_algos.py due to Dask Error #4511

Comments

nv-rliu commented Jun 28, 2024

Version

Which installation method(s) does this occur on?

Describe the bug.

Minimum reproducible example

Relevant log output

Environment details

Other/Misc.

Code of Conduct

`bench_SSSP` Failing in `bench_algos.py` due to Dask Error #4511

`bench_SSSP` Failing in `bench_algos.py` due to Dask Error #4511