-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Out of Memory Bug #93
Comments
Hi @jiruijin, this looks like an OOM crash during the precomputation of the graphs and line graphs. We've been storing those in memory, but if you have a large dataset or very large supercells the memory cost is too high. Actually if you have large supercells it's likely going to be a problem for GPU memory utilization too I think the long term fix is to switch to this vectorized graph construction and stop storing the fully-featurized graphs in memory. This has been on the back burner for a bit, I have to make some time to bring it over the line |
Thanks a lot for answering so fast @bdecost, but I think when you trained the model for the material project dataset, it should be a larger dataset than me. And I tried with a small dataset like 4000 cif files, it worked. It seems that currently, I can only decrease the size of dataset, hope you can fix it soon. |
how many atoms are in a typical supercell for your dataset? The other week a collaborator was having a similar issue on a dataset with ~800 atom supercells, so I do need to prioritize this soon |
The dataset I am using is from oqmd, half of them (12000) only have 5 five atoms. In the rest of the dataset, the range for atoms number is from 10 to 40. |
You can lower the batch size to something like 2 or 5 , which should resolve the issue. |
Thank you@knc6, I will try it now! |
I got the exact same error when I lower the batch size to 5 or 8. |
can you clarify exactly when the resource manager is killing your job? from your screen cap it looks like it's during the dataloader setup before any data is sent to the GPU?
I'm a little surprised this dataset is hitting your memory limit, but I will investigate. Do you mind sharing the slurm configuration you're using and maybe the limits of the partition you're running on? |
to clarify a bit, if GPU memory limit is the issue, you'll get a CUDA out of memory error, and slurm will kill your job because the training script crashed. I think you are running into a memory vs compute tradeoff that we made because our graph construction code was slow. I'm working on fixing that, but there are a couple other fixes I'm trying to do at the same time... |
right, ok. How many dataloader workers are you using? Each worker process apparently makes a full copy of the dataset, so if you're using multiple you can try to reduce that for now as a band-aid That's not a real solution, but I'll try to get a fix out this week or next. There are a few things to do, most importantly including computing the graphs during minibatch construction instead of caching them all in memory |
I am using the default setting: num_workers: int = 4. I will decrease it and retry. Thank you! |
Still have the same problem. I will wait for the fixing. |
I used get_primitive_structure function in Pymatgen for all my cif files and set num_workers = 10 in the config file, then the problem is solved! |
I trained the model with 28000 cif data, but every time I ran it, I got an error:
"slurmstepd: error: Detected 1 oom-kill event(s) in StepId=59934605.batch. Some of your processes may have been killed by the cgroup out-of-memory handler. "
I already used 500GB for CPU, why is it still out of memory?
The text was updated successfully, but these errors were encountered: