-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rllib] Memory leak in environment worker in multi-agent setup #9964
Comments
@sergeivolodin How high does it go? Also, can you try different Tensorflow versions? |
@richardliaw Here it ate all the remaining memory on my laptop and crashed wsl. Which versions of tensorflow do you want me to try? |
We also observed the same issue running QMIX with PyTorch on SMAC. In my case, the machine has 16GB of RAM, and the training would eventually consume all the RAM and crash (at about 600+ iteration using the 2s3z map). Tried setting Ray version: 0.8.7 Error message:
|
@jyericlin as a workaround, we just create a separate process for every training step, which opens a checkpoint, does the iteration and then saves a checkpoint (pseudocode below, fully functional example here). The extra time spent on creating process/checkpointing does not seem too bad (for our case!) while True:
config_filename = pickle_config(config)
# starts a new python process from bash
# important, can't just fork
# because of tensorflow+fork import issues
start_process_and_wait(target=train, config_filename)
results = unpickle_results(config)
delete_temporary_files()
if results.iteration > N: break
config['checkpoint'] = results['checkpoint']
def train(config_filename):
config = unpickle_config(config_filename)
trainer = PPO(config)
trainer.restore(config['checkpoint'])
results = trainer.train()
checkpoint = trainer.save()
results['checkpoint'] = checkpoint
pickle_results(results)
# important -- otherwise memory will go up!
trainer.stop() Note that in |
@richardliaw any updates? Do you want me to try different tf versions? Thank you. |
Hmm, could you try the latest Ray wheels (latest snapshot of master) to see if this was fixed? |
@richardliaw same thing |
@richardliaw is there a way to attach a Python debugger to a ray worker, preferably from an IDE like PyCharm? I could take a look at where and why the memory is being used |
There's now a Ray PDB tool: https://docs.ray.io/en/master/ray-debugging.html?highlight=debugger |
Is there any update on this issue? I have been dealing with the same. |
Hi @sergeivolodin, how are you plotting this graphs, please? suspecting that I'm having a similar issue |
@azzeddineCH using this custom script: https://github.com/HumanCompatibleAI/better-adversarial-defenses/tree/master/other/memory_profile
Good luck! |
I've done some more digging using tracemalloc and I dont think the bug is actually in the python code as there's no large consistent allocations of python objects. This leaves it to be a potential issue with pytorch or some c++ library, or something to do with how ray handles worker memory. The bug also does not seem to happen consistently for me across machines. More specifically, on servers where cgroups are enabled I get the memory leak, but on mu local machine I do not. @sven1977 any ideas? |
What is the problem?
When training in a multi-agent environment using multiple environment workers, the memory of the workers increases constantly and is not released after the policy updates.
If no memory limit is set, the processes run out of system memory and are killed. If the
memory_per_worker
limit is set, they go past the limit and are killed.Ray version and other system information (Python version, TensorFlow version, OS):
ray==0.8.6
Python 3.8.5 (default, Aug 5 2020, 08:36:46)
tensorflow==2.3.0
Ubuntu 18.04.4 LTS (GNU/Linux 4.19.121-microsoft-standard x86_64)
(same thing in non-wsl as well)Reproduction
Run this and measure memory consumption. If you remove
memory_per_worker
limits, it will take longer, as workers will try to consume all system memory.The text was updated successfully, but these errors were encountered: