[rllib] Memory leak in environment worker in multi-agent setup #9964

sergeivolodin · 2020-08-06T19:08:52Z

What is the problem?

When training in a multi-agent environment using multiple environment workers, the memory of the workers increases constantly and is not released after the policy updates.

== Status ==
Memory usage on this node: 5.2/8.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/5.03 GiB heap, 0.0/1.71 GiB objects
Result logdir: /home/sergei/ray_results/dummy_run
Number of trials: 1 (1 ERROR)
+------------------------------------+----------+-------+--------+------------------+---------+----------+
| Trial name                         | status   | loc   |   iter |   total time (s) |      ts |   reward |
|------------------------------------+----------+-------+--------+------------------+---------+----------|
| PPO_DummyMultiAgentEnv_fbe89_00000 | ERROR    |       |     57 |          4463.08 | 1881000 |        0 |
+------------------------------------+----------+-------+--------+------------------+---------+----------+
Number of errored trials: 1
+------------------------------------+--------------+---------------------------------------------------------------------------------------------------+
| Trial name                         |   # failures | error file                                                                                        |
|------------------------------------+--------------+---------------------------------------------------------------------------------------------------|
| PPO_DummyMultiAgentEnv_fbe89_00000 |            1 | /home/sergei/ray_results/dummy_run/PPO_DummyMultiAgentEnv_0_2020-08-05_16-21-42pm5cl_th/error.txt |
+------------------------------------+--------------+---------------------------------------------------------------------------------------------------+

ray.exceptions.RayTaskError(RayOutOfMemoryError): ray::RolloutWorker.par_iter_next() (pid=7604, ip=192.168.175.153)
  File "python/ray/_raylet.pyx", line 408, in ray._raylet.execute_task
  File "/home/sergei/miniconda3/envs/fresh_ray/lib/python3.8/site-packages/ray/memory_monitor.py", line 137, in raise_if_low_memory
    raise RayOutOfMemoryError(
ray.memory_monitor.RayOutOfMemoryError: Heap memory usage for ray_RolloutWorker_7604 is 0.4395 / 0.4395 GiB limit

If no memory limit is set, the processes run out of system memory and are killed. If the memory_per_worker limit is set, they go past the limit and are killed.

Ray version and other system information (Python version, TensorFlow version, OS):

Name	Value
Ray version	`ray==0.8.6`
Python version	`Python 3.8.5 (default, Aug 5 2020, 08:36:46)`
TensorFlow version	`tensorflow==2.3.0`
OS	`Ubuntu 18.04.4 LTS (GNU/Linux 4.19.121-microsoft-standard x86_64)` (same thing in non-wsl as well)

Reproduction

Run this and measure memory consumption. If you remove memory_per_worker limits, it will take longer, as workers will try to consume all system memory.

import numpy as np
import ray
import tensorflow as tf
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy
from ray.rllib.env.multi_agent_env import MultiAgentEnv
import numpy as np
from gym.spaces import Box
from ray.tune.registry import register_env


def dim_to_gym_box(dim, val=np.inf):
    """Create gym.Box with specified dimension."""
    high = np.full((dim,), fill_value=val)
    return Box(low=-high, high=high)


class DummyMultiAgentEnv(MultiAgentEnv):
    """Return zero observations."""

    def __init__(self, config):
        del config  # Unused
        super(DummyMultiAgentEnv, self).__init__()
        self.config = dict(act_dim=17, obs_dim=380, n_players=2, n_steps=1000)
        self.players = ["player_%d" % p for p in range(self.config['n_players'])]
        self.current_step = 0

    def _obs(self):
        return np.zeros((self.config['obs_dim'],))

    def reset(self):
        self.current_step = 0
        return {p: self._obs() for p in self.players}

    def step(self, action_dict):
        done = self.current_step >= self.config['n_steps']
        self.current_step += 1

        obs = {p: self._obs() for p in self.players}
        rew = {p: 0.0 for p in self.players}
        dones = {p: done for p in self.players + ["__all__"]}
        infos = {p: {} for p in self.players}

        return obs, rew, dones, infos

    @property
    def observation_space(self):
        return dim_to_gym_box(self.config['obs_dim'])

    @property
    def action_space(self):
        return dim_to_gym_box(self.config['act_dim'])


def create_env(config):
    """Create the dummy environment."""
    return DummyMultiAgentEnv(config)


env_name = "DummyMultiAgentEnv"
register_env(env_name, create_env)


def get_trainer_config(env_config, train_policies, num_workers=5, framework="tfe"):
    """Build configuration for 1 run."""

    # obtaining parameters from the environment
    env = create_env(env_config)
    act_space = env.action_space
    obs_space = env.observation_space
    players = env.players
    del env

    def get_policy_config(p):
        """Get policy configuration for a player."""
        return (PPOTFPolicy, obs_space, act_space, {
            "model": {
                "use_lstm": False,
                "fcnet_hiddens": [64, 64],
            },
            "framework": framework,
        })

    # creating policies
    policies = {p: get_policy_config(p) for p in players}

    # trainer config
    config = {
        "env": env_name, "env_config": env_config, "num_workers": num_workers,
        "multiagent": {"policy_mapping_fn": lambda x: x, "policies": policies,
                       "policies_to_train": train_policies},
        "framework": framework,
        "train_batch_size": 32768,
        "num_sgd_iter": 1,
        "sgd_minibatch_size": 32768,

        # 450 megabytes for each worker (enough for 1 iteration)
        "memory_per_worker": 450 * 1024 * 1024,
        "object_store_memory_per_worker": 128 * 1024 * 1024,
    }
    return config


def tune_run():
    ray.init(ignore_reinit_error=True)
    config = get_trainer_config(train_policies=['player_1', 'player_2'], env_config={})
    return tune.run("PPO", config=config, verbose=True, name="dummy_run", num_samples=1)


if __name__ == '__main__':
    tune_run()

I have verified my script runs in a clean environment and reproduces the issue.
I have verified the issue also occurs with the latest wheels.

The text was updated successfully, but these errors were encountered:

richardliaw · 2020-08-07T21:11:13Z

@sergeivolodin How high does it go? Also, can you try different Tensorflow versions?

sergeivolodin · 2020-08-10T13:01:26Z

@richardliaw Here it ate all the remaining memory on my laptop and crashed wsl. Which versions of tensorflow do you want me to try?

jyericlin · 2020-08-20T18:17:34Z

We also observed the same issue running QMIX with PyTorch on SMAC. In my case, the machine has 16GB of RAM, and the training would eventually consume all the RAM and crash (at about 600+ iteration using the 2s3z map).

Tried setting object_store_memory=1*10**9, redis_max_memory=1*10**9 for ray init but didn't help.

Ray version: 0.8.7
OS: Ubuntu 16.04
Pytorch version: 1.3.0
command used: python run_qmix.py --num-iters=1000 --num-workers=7 --map-name=2s3z (using the example from SMAC repository)

Error message:

Failure # 1 (occurred at 2020-08-20_13-12-55)
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/tune/trial_runner.py", line 471, in _process_trial
    result = self.trial_executor.fetch_result(trial)
  File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/tune/ray_trial_executor.py", line 430, in fetch_result
    result = ray.get(trial_future[0], DEFAULT_GET_TIMEOUT)
  File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/worker.py", line 1538, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RayOutOfMemoryError): ^[[36mray::QMIX.train()^[[39m (pid=949, ip=10.11.0.13)
  File "python/ray/_raylet.pyx", line 440, in ray._raylet.execute_task
  File "/home/ubuntu/anaconda3/envs/raydev/lib/python3.6/site-packages/ray/memory_monitor.py", line 128, in raise_if_low_memory
    self.error_threshold))
ray.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node rltb-1 is used (14.9 / 15.67 GB). The top 10 memory consumers are:

PID MEM COMMAND
949 6.25GiB ray::QMIX
4628    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 24827 -dataDir /home/ubu
4631    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 24013 -dataDir /home/ubu
4637    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 20642 -dataDir /home/ubu
4625    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 20480 -dataDir /home/ubu
4624    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 19296 -dataDir /home/ubu
4626    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 21827 -dataDir /home/ubu
4627    0.5GiB  /home/ubuntu/StarCraftII/Versions/Base69232/SC2_x64 -listen 127.0.0.1 -port 17905 -dataDir /home/ubu

sergeivolodin · 2020-09-01T00:46:20Z

@jyericlin as a workaround, we just create a separate process for every training step, which opens a checkpoint, does the iteration and then saves a checkpoint (pseudocode below, fully functional example here). The extra time spent on creating process/checkpointing does not seem too bad (for our case!)

while True:
  config_filename = pickle_config(config)

  # starts a new python process from bash
  # important, can't just fork
  # because of tensorflow+fork import issues
  start_process_and_wait(target=train, config_filename)
  results = unpickle_results(config)
  delete_temporary_files()
  if results.iteration > N: break
  config['checkpoint'] = results['checkpoint']

def train(config_filename):
  config = unpickle_config(config_filename)
  trainer = PPO(config)
  trainer.restore(config['checkpoint'])
  results = trainer.train()
  checkpoint = trainer.save()
  results['checkpoint'] = checkpoint
  pickle_results(results)

  # important -- otherwise memory will go up!
  trainer.stop()

Note that in train we also need to reconnect to the existing ray instance to reuse worker processes.

sergeivolodin · 2020-11-01T22:26:47Z

@richardliaw any updates? Do you want me to try different tf versions? Thank you.

richardliaw · 2020-11-02T04:39:51Z

Hmm, could you try the latest Ray wheels (latest snapshot of master) to see if this was fixed?

sergeivolodin · 2020-11-10T19:08:01Z

@richardliaw same thing

sergeivolodin · 2020-11-11T01:46:16Z

@richardliaw is there a way to attach a Python debugger to a ray worker, preferably from an IDE like PyCharm? I could take a look at where and why the memory is being used

richardliaw · 2020-11-11T02:00:10Z

There's now a Ray PDB tool: https://docs.ray.io/en/master/ray-debugging.html?highlight=debugger

jackbart94 · 2020-12-14T18:38:04Z

Is there any update on this issue? I have been dealing with the same.

azzeddineCH · 2021-01-18T13:04:47Z

Hi @sergeivolodin, how are you plotting this graphs, please? suspecting that I'm having a similar issue

sergeivolodin · 2021-01-20T23:15:31Z

@azzeddineCH using this custom script:

https://github.com/HumanCompatibleAI/better-adversarial-defenses/tree/master/other/memory_profile

(assuming python3) $ pip install psutil numpy matplotlib humanize inquirer
$ python mem_profile.py -- will write data to a file for all processes for your user and save to mem_out_{username}_{time_start}.txt (will print the filename)
Run $ python mem_analyze.py. For the last one, there are options

You can either specify the file --input FILENAME, or the script will ask you which file to open (navigate up-down, use Enter to select)
By-default, the script will report usage for ['ray::ImplicitFunc', 'ray::RolloutWorker'], but you can change that either by
- adding the --customize flag and selecting the processes interactively (navigate up/down, SPACE bar to select, enter to finish selecting)
- adding program names with --track PROCESS_NAME flag
To subtract the initial memory values from all chart values, add --subtract
--max_lines only reads that many lines from the file (useful if the first script was running for more time than the monitored program itself)

Good luck!

Bam4d · 2021-05-06T17:25:59Z

I think I'm runing into this too. seems like there are "jumps" in memory at regular intervals for me. The leak is very slow for me, and I'm pretty sure I've ruled out everything else it could be but a bug in MA implementation in RLLib.

This is with the latest wheels, python 3.8, pytorch 1.8.0

Bam4d · 2021-05-12T08:38:01Z

I've done some more digging using tracemalloc and I dont think the bug is actually in the python code as there's no large consistent allocations of python objects. This leaves it to be a potential issue with pytorch or some c++ library, or something to do with how ray handles worker memory.

The bug also does not seem to happen consistently for me across machines. More specifically, on servers where cgroups are enabled I get the memory leak, but on mu local machine I do not. @sven1977 any ideas?
I added a bit more information here: https://discuss.ray.io/t/help-debugging-a-memory-leak-in-rllib/2100

sergeivolodin added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 6, 2020

richardliaw added P2 Important issue, but not time-critical rllib and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Aug 7, 2020

ericl added this to the RLlib Bugs milestone Mar 11, 2021

ericl removed the rllib label Mar 11, 2021

Bam4d mentioned this issue May 14, 2021

[RLlib] Fixing Memory Leak In Multi-Agent environments. Adding tooling for finding memory leaks in workers. #15815

Merged

6 tasks

sven1977 closed this as completed in #15815 May 18, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rllib] Memory leak in environment worker in multi-agent setup #9964

[rllib] Memory leak in environment worker in multi-agent setup #9964

sergeivolodin commented Aug 6, 2020

richardliaw commented Aug 7, 2020

sergeivolodin commented Aug 10, 2020

jyericlin commented Aug 20, 2020 •

edited

Loading

sergeivolodin commented Sep 1, 2020 •

edited

Loading

sergeivolodin commented Nov 1, 2020

richardliaw commented Nov 2, 2020

sergeivolodin commented Nov 10, 2020

sergeivolodin commented Nov 11, 2020 •

edited

Loading

richardliaw commented Nov 11, 2020

jackbart94 commented Dec 14, 2020

azzeddineCH commented Jan 18, 2021

sergeivolodin commented Jan 20, 2021 •

edited

Loading

Bam4d commented May 6, 2021 •

edited

Loading

Bam4d commented May 12, 2021

[rllib] Memory leak in environment worker in multi-agent setup #9964

[rllib] Memory leak in environment worker in multi-agent setup #9964

Comments

sergeivolodin commented Aug 6, 2020

What is the problem?

Reproduction

richardliaw commented Aug 7, 2020

sergeivolodin commented Aug 10, 2020

jyericlin commented Aug 20, 2020 • edited Loading

sergeivolodin commented Sep 1, 2020 • edited Loading

sergeivolodin commented Nov 1, 2020

richardliaw commented Nov 2, 2020

sergeivolodin commented Nov 10, 2020

sergeivolodin commented Nov 11, 2020 • edited Loading

richardliaw commented Nov 11, 2020

jackbart94 commented Dec 14, 2020

azzeddineCH commented Jan 18, 2021

sergeivolodin commented Jan 20, 2021 • edited Loading

Bam4d commented May 6, 2021 • edited Loading

Bam4d commented May 12, 2021

jyericlin commented Aug 20, 2020 •

edited

Loading

sergeivolodin commented Sep 1, 2020 •

edited

Loading

sergeivolodin commented Nov 11, 2020 •

edited

Loading

sergeivolodin commented Jan 20, 2021 •

edited

Loading

Bam4d commented May 6, 2021 •

edited

Loading