Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus errors/"too many open files" errors in jobs with large numbers of Estimators using temporary files #631

Closed
tsalo opened this issue Feb 7, 2022 · 6 comments
Labels
bug Issues noting problems and PRs fixing those problems. cbma Issues/PRs pertaining to coordinate-based meta-analysis impact: low Estimated low impact task priority: medium Estimated medium priority task

Comments

@tsalo
Copy link
Member

tsalo commented Feb 7, 2022

Summary

When running a large number of meta-analyses on the FIU HPC, I end up with either a bus error or a "too many open files" error.

The bus error occurs after ~7 hours on the FIU HPC when the temporary directory (i.e., where temporary files are written) is set to /tmp. The "too many open files" error occurs after at least 24 hours when the temporary directory is set to /scratch.

While I made progress on tracking and closing memmapped files in #597, it looks like there are still open files slipping through.

Hopefully the impact of this bug is low, in that it should only arise when running large numbers of meta-analyses with memory limits in place.

Additional details

  • NiMARE version: 0.0.12rc1

I originally noticed this problem when I started working on the NiMARE Jupyter book, which eventually led to me writing temporary files to the NiMARE data directory instead of tmpdir in #460. However, that inevitably slowed down operations on those temporary files, so I switched back in #599. My hope was that the problem would be resolved, since I never figured out the cause of the issue in the first place.

@tsalo tsalo added bug Issues noting problems and PRs fixing those problems. cbma Issues/PRs pertaining to coordinate-based meta-analysis labels Feb 7, 2022
@tsalo
Copy link
Member Author

tsalo commented Feb 8, 2022

It looks like the bus errors, while inconsistent in terms of what datasets cause them, tend to occur about 4.5 hours into my jobs on the FIU HPC, and it happens specifically when writing MA values to the temporary file. Specifically line 806 below:

NiMARE/nimare/utils.py

Lines 801 to 807 in 3675d96

for map_chunk in map_chunks:
end_idx = idx + len(map_chunk)
LGR.debug(f"Masking {idx}:{end_idx}/{masked_data.shape[0]}")
map_chunk_data = masker.transform(map_chunk)
LGR.debug(f"Saving {idx}:{end_idx}/{masked_data.shape[0]}")
masked_data[idx:end_idx, :] = map_chunk_data
idx = end_idx

@tsalo
Copy link
Member Author

tsalo commented Feb 10, 2022

As proposed by @JulioAPeraza, setting a different TMPDIR environment variable fixes the problem, so it's specifically an issue with /tmp. I don't know if it's just an idiosyncrasy of the FIU HPC or if this will occur with other servers using the same OS, but I think I'm going to close this issue regardless.

@tsalo
Copy link
Member Author

tsalo commented Feb 10, 2022

After ~21 hours, my profiling jobs failed with the following error:

Traceback (most recent call last):
  File "profile_kerneltransformers.py", line 151, in <module>
    mem = memory_usage((meta.fit, (red_dset_ma,)))
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/memory_profiler.py", line 377, in memory_usage
    returned = f(*args, **kw)
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/base.py", line 314, in fit
    maps = self._fit(dataset)
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/utils.py", line 695, in memmap_context
    return function(self, *args, **kwargs)
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/cbma/base.py", line 78, in _fit
    ma_values = self._collect_ma_maps(
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/meta/cbma/base.py", line 165, in _collect_ma_maps
    ma_maps = _safe_transform(
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/nimare/utils.py", line 784, in _safe_transform
    masked_data = np.memmap(
  File "/home/tsalo006/nimare/joblib/conda_env/lib/python3.8/site-packages/numpy/core/memmap.py", line 267, in __new__
    mm = mmap.mmap(fid.fileno(), bytes, access=acc, offset=start)
OSError: [Errno 24] Too many open files

I think that the issue is that NiMARE's memmaps are never closed, even though the associated files are deleted. I don't know if this is what was causing the bus error, but it seems like it could be related.

@tsalo
Copy link
Member Author

tsalo commented Feb 16, 2022

I still get a "Too many open files" after merging #597, though it takes longer to happen.

@tsalo
Copy link
Member Author

tsalo commented Feb 17, 2022

I also get the bus error when using /tmp, but it happens after 7 hours instead of 4.5. I'm guessing that the problem does come down to unclosed memmap files, and that #597 helped with that issue, but it didn't completely fix it. I'm stumped on how to identify the rest of the open files.

@tsalo tsalo reopened this Feb 21, 2022
@tsalo tsalo changed the title Bus errors in large meta-analyses with memory_limit and pre-generated MA maps Bus errors/"too many open files" errors in jobs with large numbers of Estimators using temporary files Feb 21, 2022
@tsalo tsalo moved this from Done to Todo in NiMARE/PyMARE memory management Feb 21, 2022
@tsalo tsalo added impact: low Estimated low impact task priority: medium Estimated medium priority task labels Feb 21, 2022
@tsalo
Copy link
Member Author

tsalo commented May 26, 2022

Now that we're not using memmaps anymore, I think I can close this.

@tsalo tsalo closed this as completed May 26, 2022
Repository owner moved this from Todo to Done in NiMARE/PyMARE memory management May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issues noting problems and PRs fixing those problems. cbma Issues/PRs pertaining to coordinate-based meta-analysis impact: low Estimated low impact task priority: medium Estimated medium priority task
Projects
Development

No branches or pull requests

1 participant