Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modin fails on a simple code snippet with ray==2.1.0 in development environment #5208

Closed
dchigarev opened this issue Nov 8, 2022 · 6 comments · Fixed by #5283
Closed

Modin fails on a simple code snippet with ray==2.1.0 in development environment #5208

dchigarev opened this issue Nov 8, 2022 · 6 comments · Fixed by #5283
Assignees
Labels
bug 🦗 Something isn't working dependencies 🔗 Issues related to dependencies P1 Important tasks that we should complete soon Ray ⚡ Issues related to the Ray engine

Comments

@dchigarev
Copy link
Collaborator

dchigarev commented Nov 8, 2022

Modin fails with the recently released ray 2.1.0 even for this simple code snippet:

import modin.pandas as pd
print((pd.DataFrame([[1]]) + 1)._to_pandas())
Traceback
2022-11-08 12:37:32,128 ERROR services.py:1403 -- Failed to start the dashboard: Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-08_12-37-30_420275_2718136/logs/dashboard.log:
2022-11-08 12:37:32,128 ERROR services.py:1404 -- Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-08_12-37-30_420275_2718136/logs/dashboard.log:
Traceback (most recent call last):
  File "/localdisk/dchigare/miniconda3/envs/test_test_modin_/lib/python3.8/site-packages/ray/_private/services.py", line 1389, in start_api_server
    raise Exception(err_msg + last_log_str)
Exception: Failed to start the dashboard, return code -11
 The last 10 lines of /tmp/ray/session_2022-11-08_12-37-30_420275_2718136/logs/dashboard.log:
2022-11-08 12:37:32,251 INFO worker.py:1528 -- Started a local Ray instance.
UserWarning: Distributing <class 'list'> object. This may take some time.
Traceback (most recent call last):
  File "t3.py", line 6, in <module>
    print((pd.DataFrame([[1]]) + 1)._to_pandas())
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/pandas/dataframe.py", line 2883, in _to_pandas
    return self._query_compiler.to_pandas()
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/storage_formats/pandas/query_compiler.py", line 287, in to_pandas
    return self._modin_frame.to_pandas()
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 125, in run_f_on_minimally_updated_metadata
    result = f(self, *args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/dataframe/dataframe.py", line 3089, in to_pandas
    df = self._partition_mgr_cls.to_pandas(self._partitions)
  File "/localdisk/dchigare/repos/modin/modin/logging/logger_decorator.py", line 128, in run_and_log
    return obj(*args, **kwargs)
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in to_pandas
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition_manager.py", line 644, in <listcomp>
    retrieved_objects = [[obj.to_pandas() for obj in part] for part in partitions]
  File "/localdisk/dchigare/repos/modin/modin/core/dataframe/pandas/partitioning/partition.py", line 145, in to_pandas
    dataframe = self.get()
  File "/localdisk/dchigare/repos/modin/modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py", line 81, in get
    result = RayWrapper.materialize(self._data)
  File "/localdisk/dchigare/repos/modin/modin/core/execution/ray/common/engine_wrapper.py", line 92, in materialize
    return ray.get(obj_id)
  File "/localdisk/dchigare/miniconda3/envs/test_test_modin_/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/localdisk/dchigare/miniconda3/envs/test_test_modin_/lib/python3.8/site-packages/ray/_private/worker.py", line 2291, in get
    raise value
ray.exceptions.LocalRayletDiedError: The task's local raylet died. Check raylet.out for more information.
(raylet) [2022-11-08 12:37:32,802 E 2718405 2718451] (raylet) agent_manager.cc:134: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. See `dashboard_agent.log` for the root cause.

The log files that the error mentions seem to not contain any useful info:

dashboard_agent.log

2022-11-08 12:37:32,673 INFO agent.py:102 -- Parent pid is 2718405
2022-11-08 12:37:32,674 INFO agent.py:128 -- Dashboard agent grpc address: 0.0.0.0:63816
python-core-driver0.log
Global stats: 12 total (8 active)
Queueing time: mean = 20.740 us, max = 112.582 us, min = 7.329 us, total = 248.875 us
Execution time:  mean = 9.748 us, total = 116.980 us
Event stats:
        PeriodicalRunner.RunFnPeriodically - 6 total (4 active, 1 running), CPU time: mean = 1.763 us, total = 10.579 us
        InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 88.407 us, total = 88.407 us
        WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 17.994 us, total = 17.994 us
        InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
        CoreWorker.deadline_timer.flush_profiling_events - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
        NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
        UNKNOWN - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s


[2022-11-08 12:37:32,260 I 2718136 2718471] accessor.cc:608: Received notification for node id = b8eb05a6aaee9b4f7f7d8b5c2188bfc617ee57f6583b546a554f6939, IsAlive = 1
[2022-11-08 12:37:32,807 W 2718136 2718471] direct_task_transport.cc:488: The worker failed to receive a response from the local raylet because the raylet is unavailable (crashed). Error: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details:
[2022-11-08 12:37:32,807 I 2718136 2718471] task_manager.cc:507: Task failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: : Type=NORMAL_TASK, Language=PYTHON, Resources: {CPU: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=modin.core.execution.ray.implementations.pandas_on_ray.partitioning.partition, class_name=, function_name=_apply_func, function_hash=f11822ef8d2c45b093f2b1e8e55ee73f}, task_id=c8ef45ccd0112571ffffffffffffffffffffffff01000000, task_name=_apply_func, job_id=01000000, num_args=4, num_returns=4, depth=1, attempt_number=0, max_retries=3, serialized_runtime_env={"env_vars": {"__MODIN_AUTOIMPORT_PANDAS__": "1"}}, eager_install=1, setup_timeout_seconds=600
[2022-11-08 12:37:33,260 I 2718136 2718471] raylet_client.cc:364: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
[2022-11-08 12:37:34,261 I 2718136 2718471] raylet_client.cc:364: Error reporting task backlog information: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details:
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:637: Disconnecting to the raylet.
[2022-11-08 12:37:34,268 I 2718136 2718136] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0
[2022-11-08 12:37:34,268 W 2718136 2718136] raylet_client.cc:188: IOError: Broken pipe [RayletClient] Failed to disconnect from raylet. This means the raylet the worker is connected is probably already dead.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:580: Shutting down a core worker.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:604: Disconnecting a GCS client.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:608: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.
[2022-11-08 12:37:34,268 I 2718136 2718471] core_worker.cc:736: Core worker main io service stopped.
[2022-11-08 12:37:34,268 W 2718136 2718474] metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: failed to connect to all addresses; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:617: Core worker ready to be deallocated.
[2022-11-08 12:37:34,268 I 2718136 2718136] core_worker.cc:571: Core worker is destructed
[2022-11-08 12:37:34,518 I 2718136 2718136] core_worker_process.cc:147: Destructing CoreWorkerProcessImpl. pid: 2718136
[2022-11-08 12:37:34,518 I 2718136 2718136] io_service_pool.cc:47: IOServicePool is stopped.

The error occurs only on Linux and only in our development environment (environment-dev.yml). Fresh installed modin with ray 2.1.0 works fine:

$ conda create -n clean_env python=3.8
$ pip install modin/
$ pip install ray==2.1.0
$ python reproducer.py # works fine
----
$ conda env create -f modin/environment-dev.yml
$ python reproducer.py # fails

The error also occurs in our CI where all of the non-windows jobs with ray fail. Example: https://github.com/modin-project/modin/actions/runs/3420080409/jobs/5697742798


p.s. the issue seems not to be related to the redis version requirements in our env recipe:

- redis>=3.5.0,<4.0.0

Both non-working and working environments use the same version of redis (3.5.1)

@dchigarev dchigarev added bug 🦗 Something isn't working dependencies 🔗 Issues related to dependencies P0 Highest priority tasks requiring immediate fix labels Nov 8, 2022
@dchigarev
Copy link
Collaborator Author

It's unclear yet what (or which package) causing the problem. I would appreciate it if someone would also look into this. @modin-project/modin-core

@pyrito
Copy link
Collaborator

pyrito commented Nov 8, 2022

@dchigarev should we pin the version of Ray for now while we have someone look into the issue?

@dchigarev
Copy link
Collaborator Author

@dchigarev should we pin the version of Ray for now while we have someone look into the issue?

sure, created a PR for this (#5209)

dchigarev added a commit to dchigarev/modin that referenced this issue Nov 8, 2022
@mvashishtha mvashishtha added P1 Important tasks that we should complete soon Ray ⚡ Issues related to the Ray engine and removed P0 Highest priority tasks requiring immediate fix labels Nov 9, 2022
@vnlitvinov vnlitvinov self-assigned this Nov 9, 2022
@h-vetinari
Copy link

h-vetinari commented Nov 14, 2022

The error occurs only on Linux and only in our development environment (environment-dev.yml).

There are a lot of things wrong with that environment:

  • dask[complete]>=2.22.0 - conda does not understand this syntax and it does not do what you expect it to only happens to work by accident; unfortunately (and in contrast to other packages like ray and modin) there's no direct replacement in conda-forge yet.
  • matplotlib<=3.2.2 - why oh why are you forcing the use of a version that's over 2 years old? This will force the solver into very weird contortions
  • coverage<5.0 - likewise, over 3 years old
  • pygithub==1.53 - over 2 years old
  • rpyc==4.1.5 - over 2 years old

Generally:

  • pandas shouldn't be pinned to patch version, see Do not pin pandas down to patch level #3371
  • move asv, black, connectorx, flake8, numpydoc, tqdm, xgboost from pip to conda-forge dependencies (ideally also ray)
  • please alphabetize your dependencies

Of course things shouldn't break wherever possible, sorry for that (and we're looking into what's going on with grpcio 1.49.1 in this context). Still, I feel obliged to point out that your requirement constraints are somewhere between "actively hostile to the solver" and "shooting yourself in the foot". I've kept banging that drum in #3371 and on the feedstock, but I cannot seem to get that point across.

It's one of the prime reasons I (personally) haven't spent much time with modin - it doesn't "play nice with others" from the POV of dependency constraints, which makes it very hard to use together with other projects (and their dependencies)1. Note that I'm not critizising temporary pins to unbreak CI, but with multiple several-year-old constraints, you're externalising a lot of costs onto the ecosystem (incl. other libraries, packagers and your users).

Footnotes

  1. It also makes it really hard to debug situations as in this bug.

@pyrito
Copy link
Collaborator

pyrito commented Nov 14, 2022

@h-vetinari thank you for raising this point! I think a lot of what you said makes sense and can be addressed in a few PRs. I agree with you that pinning versions can cause dependency headaches.

You raise a fair point w.r.t. pinning pandas to the patch level. We will take steps towards coming to a solution there!

@anmyachev
Copy link
Collaborator

@h-vetinari I think it will be interesting for you to take a look #5270 (comment). In short, sometimes the installed grpcio package does not have a restriction for libprotobuf, resulting in a segfault. Maybe you know an easy way to exclude such grpcio packages?

YarShev pushed a commit that referenced this issue Dec 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working dependencies 🔗 Issues related to dependencies P1 Important tasks that we should complete soon Ray ⚡ Issues related to the Ray engine
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants