Too many open files error #11201

whucdf · 2018-09-03T13:56:30Z

Issue description

While using the dataloader from pytorch 0.4.1:
With num_workers > 0 the workers store the tensors in shared memory, but do not release the shared memory file handles after they return the tensor to the main process and file handles are no longer needed. The worker will then run out of file handles, if one stores the tensor in a list.

Code example


from torch.utils.data import Dataset
class testSet(Dataset):
    def __init__(self):
        super(testSet,self).__init__()
    def `__len__(self):`
        return 1000000
    def __getitem__(self,index):
        return {"index":index}

import torch

test_data = testSet()
test_data_loader = torch.utils.data.DataLoader( dataset=test_data, batch_size=1, num_workers=1)
index = []
for sample in test_data_loader:
    index.append(sample['index'])

The error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-5-cf6ed576bc1c> in <module>()
----> 1 for sample in test_data_loader:
      2     #print(sample['index'])
      3     index.append(sample['index'])

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py in __next__(self)
    328         while True:
    329             assert (not self.shutdown and self.batches_outstanding > 0)
--> 330             idx, batch = self._get_batch()
    331             self.batches_outstanding -= 1
    332             if idx != self.rcvd_idx:

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/utils/data/dataloader.py in _get_batch(self)
    307                 raise RuntimeError('DataLoader timed out after {} seconds'.format(self.timeout))
    308         else:
--> 309             return self.data_queue.get()
    310 
    311     def __next__(self):

~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/queues.py in get(self)
    335             res = self._reader.recv_bytes()
    336         # unserialize the data after having released the lock
--> 337         return _ForkingPickler.loads(res)
    338 
    339     def put(self, obj):

~/anaconda3/envs/pytorch/lib/python3.6/site-packages/torch/multiprocessing/reductions.py in rebuild_storage_fd(cls, df, size)
    149         fd = multiprocessing.reduction.rebuild_handle(df)
    150     else:
--> 151         fd = df.detach()
    152     try:
    153         storage = storage_from_cache(cls, fd_id(fd))

~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/resource_sharer.py in detach(self)
     56             '''Get the fd.  This should only be called once.'''
     57             with _resource_sharer.get_connection(self._id) as conn:
---> 58                 return reduction.recv_handle(conn)
     59 
     60 

~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/reduction.py in recv_handle(conn)
    180         '''Receive a handle over a local connection.'''
    181         with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
--> 182             return recvfds(s, 1)[0]
    183 
    184     def DupFd(fd):

~/anaconda3/envs/pytorch/lib/python3.6/multiprocessing/reduction.py in recvfds(sock, size)
    159             if len(ancdata) != 1:
    160                 raise RuntimeError('received %d items of ancdata' %
--> 161                                    len(ancdata))
    162             cmsg_level, cmsg_type, cmsg_data = ancdata[0]
    163             if (cmsg_level == socket.SOL_SOCKET and

RuntimeError: received 0 items of ancdata

System Info

PyTorch
OS: Ubuntu 16.04
PyTorch version: 0.4.1

The text was updated successfully, but these errors were encountered:

weiyangfb · 2018-09-13T20:44:38Z

@whucdf Thanks for reporting this issue. It is expected because the default file_descriptor share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch to file_system strategy by adding this to your script.

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

Let me know if there is still any issue.

weiyangfb · 2018-09-14T18:39:40Z

closing this now, please feel free to reopen it if needed

zimenglan-sysu-512 · 2018-12-28T03:00:28Z

hi @weiyangfb
thanks for u help. it does solve the problem.
btw, will it slow down the traing speed?

cyzanfar · 2019-03-09T19:54:10Z

Hey!
I am still getting the same error too many open files.
Running on CPU on my Mac OSX.

traceback:

ERROR:root:Internal Python error in the inspect module.
Below is the traceback from this internal error.

Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2963, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-6-cc88ea5f8bd3>", line 2, in <module>
    num_epochs=25)
  File "<ipython-input-3-c38b0d739ba0>", line 23, in train_model
    for inputs, labels in dataloaders[phase]:
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 819, in __iter__
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 545, in __init__
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/context.py", line 102, in Queue
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/queues.py", line 42, in __init__
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/context.py", line 67, in Lock
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 163, in __init__
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/multiprocessing/synchronize.py", line 60, in __init__
OSError: [Errno 24] Too many open files

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 1863, in showtraceback
    stb = value._render_traceback_()
AttributeError: 'OSError' object has no attribute '_render_traceback_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 1095, in get_records
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 311, in wrapped
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/site-packages/IPython/core/ultratb.py", line 345, in _fixed_getinnerframes
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 1483, in getinnerframes
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 1441, in getframeinfo
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 696, in getsourcefile
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 725, in getmodule
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/inspect.py", line 709, in getabsfile
  File "/Users/cyrusghazanfar/anaconda3/lib/python3.6/posixpath.py", line 376, in abspath
OSError: [Errno 24] Too many open files

I did include the proper configurations:

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

thanks

Beastmaster · 2019-04-24T13:35:41Z

Please use deep copy when appending dataloader output to a list. Take @whucdf 's code as example

test_data = testSet() 
test_data_loader = torch.utils.data.DataLoader( dataset=test_data, batch_size=1, num_workers=1)  
index = []  
for sample in test_data_loader:  
    index.append(sample['index'])

index occupied output of data_loader and the connections among mutlprocessing.process could not be closed. So deepcopy is useful in this scenario.

import copy  
test_data = testSet()  
test_data_loader = torch.utils.data.DataLoader( dataset=test_data, batch_size=1, num_workers=1)  
index = []   
for sample in test_data_loader:  
    sample_cp = copy.deepcopy(sample)  
    del sample  
    index.append(sample_cp['index'])

…orch#11201

soulslicer · 2020-02-13T22:41:04Z

@whucdf Thanks for reporting this issue. It is expected because the default file_descriptor share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch to file_system strategy by adding this to your script.
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
Let me know if there is still any issue.

I get the error

torch_shm_manager: error while loading shared libraries: libcudart.so.10.0: cannot open shared object file: No such file or directory

Set PyTorch's shared memory strategy to "file_system", which uses file names to identify shared memory regions, rather than the default "file_descriptors", which uses file descriptors as shared memory handles. This fixes the problem of exceeding the system-wide limit on the number of open files a process can have. See https://github.com/pytorch/pytorch/issues/11201\#issuecomment-421146936 and https://pytorch.org/docs/master/multiprocessing.html?highlight=sharing%20strategy#sharing-strategies.

brando90 · 2021-02-28T23:30:30Z

@whucdf Thanks for reporting this issue. It is expected because the default file_descriptor share strategy uses file descriptors as shared memory handles, and this will hit the limit when there are too many batches at DataLoader. To get around this, you can switch to file_system strategy by adding this to your script.
import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')
Let me know if there is still any issue.

is this suppose to be run by the main process (the one doing mp.spawn) or should EVERY process run it inside their run function?

Thanks!

ref: https://pytorch.org/docs/stable/multiprocessing.html#file-descriptor-file-descriptor

https://discuss.pytorch.org/t/how-does-one-setp-up-the-set-sharing-strategy-strategy-for-multiprocessing/113302

FarisHijazi · 2021-07-11T08:43:24Z

I applied

import torch.multiprocessing
torch.multiprocessing.set_sharing_strategy('file_system')

yet still getting the same error

schuhschuh · 2021-08-09T08:38:14Z

For anyone else seeing such error even after setting torch.multiprocessing.set_sharing_strategy('file_system') in their main thread, note that worker processes of the DataLoader will not inherit this setting apparently. I had to use a worker_init_fn such as:

sharing_strategy = "file_system"
torch.multiprocessing.set_sharing_strategy(sharing_strategy)

def set_worker_sharing_strategy(worker_id: int) -> None:
    torch.multiprocessing.set_sharing_strategy(sharing_strategy)

loader = DataLoader(dataset, num_workers=4, worker_init_fn=set_worker_sharing_strategy)

This finally fixed it for me.

@brando90 This relates to your earlier question. I could confirm that the strategy is not set to the same strategy as in the main process by printing the value of torch.multiprocessing.get_sharing_strategy() in worker_init_fn.

mdabbah · 2021-08-24T17:43:36Z

@schuhschuh did your solution require you to change the setup() function (I'm assuming you are doing distributed training/inference)

my current setup function looks like this

def setup(rank, world_size, port):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = f'{port}'

    # initialize the process group
    dist.init_process_group("nccl", rank=rank, world_size=world_size)
    torch.cuda.set_device(rank)

does the solution of using file_system sharing strategy means that I must change
dist.init_process_group("nccl", rank=rank, world_size=world_size) to something like
dist.init_process_group("nccl", init_method="file::/~/somefile", rank=rank, world_size=world_size)

Thanks!

basilevh · 2022-07-19T16:06:19Z

My issue is that even with torch.multiprocessing.set_sharing_strategy('file_system'), after some time (typically in the second half of training), my job crashes with RuntimeError: unable to open shared memory object </torch_2283204_110829360> in read-write mode. This is much more likely to happen whenever I'm training more than one model in parallel on different GPUs. I verified that there is more than enough RAM and disk space available. Is there any other fix? Thank you.

Xonxt · 2022-07-23T18:08:55Z

On a slightly related note, in my training script, if I don't use the set_sharing_strategy('file_system'), I also get the "too many open files" error.

But if I add it, then it all runs fine, but at the very end of my script, all the processes just hang and never terminate. Even if I add a torch.distributed.barrier() or a torch.distributed.destroy_process_group().

LemurPwned · 2022-07-27T19:50:47Z

I experience the same issue with the latest MacOs nightly build. I am able to chew through a couple of epochs but at some point, the number of open file descriptors becomes too large -- they are simply not being closed properly. The set_sharing_strategy is not helping at all.

My dataset returns a dictionary of with 3 keys, two for float tensors and one for string.

class PhysicsDataset(Dataset):
    def __init__(self, data_dir, transform=None):
        super().__init__()
        self.data_dir = data_dir
        self.transform = transform
        self.gt_spectra = list(self.data_dir.glob("*.npz"))
        self.gt_parameters = json.load(
            open(self.data_dir / "all_params.json", 'r'))

    def __len__(self):
        return len(self.gt_spectra)

    def __getitem__(self, index):
        with np.load(self.gt_spectra[index]) as data:
            pdata = data['spectrum']
        pdata = (pdata - pdata.min()) / (pdata.max() - pdata.min())
        pdata = torch.from_numpy(pdata).float()
        parameters = self.gt_parameters[self.gt_spectra[index].name.replace(
            ".npz", "")]
        if self.transform:
            pdata = self.transform(pdata)

        # create output tensor with normalised weights
        gt_tensor = torch.from_numpy(
            np.asarray([(parameters[k] - KEYS[k]['min']) /
                        (KEYS[k]['max'] - KEYS[k]['min'])
                        for k in KEYS])).float()
        return {
            "spectrum": pdata,
            "gt_tensor": gt_tensor,
            "filename": self.gt_spectra[index].name
        }

Any ideas why the fds are not closed after each epoch terminates? I suspect this may be due to that np.load in __getitem__ but I have no idea how to fix that.

xiyanghu · 2022-07-29T17:35:42Z

On a slightly related note, in my training script, if I don't use the set_sharing_strategy('file_system'), I also get the "too many open files" error.

But if I add it, then it all runs fine, but at the very end of my script, all the processes just hang and never terminate. Even if I add a torch.distributed.barrier() or a torch.distributed.destroy_process_group().

same here. Have you figured out how to solve it? Thank you!

Xonxt · 2022-07-29T17:47:03Z

same here. Have you figured out how to solve it? Thank you!

Not sure how relevant this would be for you. In my case, I have my training dataset in a JSON-format (one that we've developed internally at our institute) similar to COCO-format. The dataset is open through a wrapper class that provide API for reading it, again, similar to COCO.

In my earlier attempts at distributed training, each process ended up opening the same JSON file on its own, and trying to read annotations from it with a bunch of workers (num_workers=16).

Something like this, basically:

dataset = JSONDataset("/datasets/coco/annotations/train.json")
train_data = torch.utils.data.Dataset(dataset, ...)
train_loader = torch.utils.data.dataloader.DataLoader(train_data, num_workers=16, ...)

Instead, I made sure to first parse the entire dataset, read the full list of image files and the corresponding labels, and the only pass a list of files and labels to the torch.utils.data.Dataset object, so the workers would only read the image files and not try to share the same JSON-file.

And then I don't touch the set_sharing_strategy function at all, just leaving it at the default value, and just put a destroy_process_group() at the end of the application.

pytorch/pytorch#11201

…he file handles, see: pytorch/pytorch#11201 (comment)

…torch/pytorch#11201 (comment)

pytorch/pytorch#11201

…1201)

automatically sets `torch.multiprocessing.set_sharing_strategy("file_system")` during opensoundscape import. We may want to revisit this decision, but it seems that this is the recommended setting for avoiding issues seen when using parallelized DataLoader see discussion and recommended solution here pytorch/pytorch#11201 (comment)

This commit adds native PyTorch xpu support for yolov5 sample, i.e. IPEX is not needed in this mode. XPU backend in PyTorch is under active development and is not finished yet. Focus is on functional side of key things and performance is expected to be low. Future improvements should bring it up. As of now this mode of operation is experimental in the sample and is not default, use `--ipex no` to enable. * Can be run as: ./run_model.sh --ipex no * Tried on: * pytorch: 91d565da0c5 ("[dynamo] Add support for tensor's is_complex method") * vision: 89d2b38cbc ("Updated compatibility table") * Status: * --jit script|none: fail on autocast * --jit trace works x16 times slower vs. IPEX (13 img/s vs. 208 img/s), likely some operations are done on CPU since blitter is loaded + seeing this warning: torch/utils/data/_utils/pin_memory.py:58: UserWarning: Aten Op fallback from XPU to CPU happends. This may have performance implications. If need debug the fallback ops please set environment variable `PYTORCH_DEBUG_XPU_FALLBACK=1` (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:11.) /home/dvrogozh/git/pytorch/torch/nn/functional.py:2103: UserWarning: The operator 'aten::silu.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch._C._nn.silu_(input) /home/dvrogozh/git/pytorch/torch/nn/functional.py:4045: UserWarning: The operator 'aten::upsample_nearest2d.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/models/common.py:303: UserWarning: The operator 'aten::cat.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch.cat(x, self.d) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/models/common.py:158: UserWarning: The operator 'aten::cat.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1)) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/models/yolo.py:66: UserWarning: The operator 'aten::sigmoid.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) y = x[i].sigmoid() /home/dvrogozh/git/pytorch/torch/_tensor.py:40: UserWarning: The operator 'aten::pow.Tensor_Scalar_outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return f(*args, **kwargs) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/models/yolo.py:77: UserWarning: The operator 'aten::cat.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/utils/general.py:834: UserWarning: The operator 'aten::gt.Scalar_outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) xc = prediction[..., 4] > conf_thres # candidates /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/utils/general.py:854: UserWarning: The operator 'aten::nonzeroon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) x = x[xc[xi]] # confidence /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/utils/general.py:854: UserWarning: The operator 'aten::index.Tensor_outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) x = x[xc[xi]] # confidence See: pytorch/pytorch#11201 See: pytorch/pytorch#114723 Signed-off-by: Dmitry Rogozhkin <[email protected]>

This commit adds native PyTorch xpu support for efficientnet sample, i.e. IPEX is not needed in this mode. XPU backend in PyTorch is under active development and is not finished yet. Focus is on functional side of key things and performance is expected to be low. Future improvements should bring it up. As of now this mode of operation is experimental in the sample and is not default, use `--ipex yes` to enable. Commit also switches enet sample to torch variant of multiprocessing module and uses set_sharing_strategy('file_system') to avoid too many open files error on dataloader. * Can be run as: ./run_model.sh --ipex no * Tried on: * pytorch: 4e66aaa0109 ("update kineto submodel commit id...") * vision: 96640af090 ("add float support to...") * Status: * --jit script|none: fail on autocast * --jit trace works x30 times slower vs. IPEX (5 img/s vs. 150 img/s), likely some operations are done on CPU since blitter is loaded + seeing this warning: torch/utils/data/_utils/pin_memory.py:58: UserWarning: Aten Op fallback from XPU to CPU happends. This may have performance implications. If need debug the fallbac k ops please set environment variable `PYTORCH_DEBUG_XPU_FALLBACK=1` (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:11.) /home/dvrogozh/git/pytorch/torch/nn/functional.py:2511: UserWarning: The operator 'aten::native_batch_normon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch.batch_norm( /home/dvrogozh/git/pytorch/torch/nn/functional.py:2103: UserWarning: The operator 'aten::silu.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch._C._nn.silu_(input) /home/dvrogozh/git/pytorch/torch/nn/functional.py:1260: UserWarning: The operator 'aten::_adaptive_avg_pool2don the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch._C._nn.adaptive_avg_pool2d(input, _output_size) /home/dvrogozh/git/pytorch/torch/nn/modules/activation.py:292: UserWarning: The operator 'aten::sigmoid.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch.sigmoid(input) See: pytorch/pytorch#11201 See: pytorch/pytorch#114723 Signed-off-by: Dmitry Rogozhkin <[email protected]>

This commit adds native PyTorch xpu support for yolov5 sample, i.e. IPEX is not needed in this mode. XPU backend in PyTorch is under active development and is not finished yet. Focus is on functional side of key things and performance is expected to be low. Future improvements should bring it up. As of now this mode of operation is experimental in the sample and is not default, use `--ipex no` to enable. * Can be run as: ./run_model.sh --ipex no * Tried on: * pytorch: 91d565da0c5 ("[dynamo] Add support for tensor's is_complex method") * vision: 89d2b38cbc ("Updated compatibility table") * Status: * --jit script|none: fail on autocast * --jit trace works x16 times slower vs. IPEX (13 img/s vs. 208 img/s), likely some operations are done on CPU since blitter is loaded + seeing this warning: torch/utils/data/_utils/pin_memory.py:58: UserWarning: Aten Op fallback from XPU to CPU happends. This may have performance implications. If need debug the fallback ops please set environment variable `PYTORCH_DEBUG_XPU_FALLBACK=1` (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:11.) /home/dvrogozh/git/pytorch/torch/nn/functional.py:2103: UserWarning: The operator 'aten::silu.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch._C._nn.silu_(input) /home/dvrogozh/git/pytorch/torch/nn/functional.py:4045: UserWarning: The operator 'aten::upsample_nearest2d.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch._C._nn.upsample_nearest2d(input, output_size, scale_factors) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/models/common.py:303: UserWarning: The operator 'aten::cat.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch.cat(x, self.d) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/models/common.py:158: UserWarning: The operator 'aten::cat.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return self.cv3(torch.cat((self.m(self.cv1(x)), self.cv2(x)), 1)) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/models/yolo.py:66: UserWarning: The operator 'aten::sigmoid.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) y = x[i].sigmoid() /home/dvrogozh/git/pytorch/torch/_tensor.py:40: UserWarning: The operator 'aten::pow.Tensor_Scalar_outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return f(*args, **kwargs) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/models/yolo.py:77: UserWarning: The operator 'aten::cat.outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x) /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/utils/general.py:834: UserWarning: The operator 'aten::gt.Scalar_outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) xc = prediction[..., 4] > conf_thres # candidates /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/utils/general.py:854: UserWarning: The operator 'aten::nonzeroon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) x = x[xc[xi]] # confidence /home/dvrogozh/git/frameworks.ai.models.intel-models/models_v2/pytorch/yolov5/inference/gpu/yolov5/utils/general.py:854: UserWarning: The operator 'aten::index.Tensor_outon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) x = x[xc[xi]] # confidence See: pytorch/pytorch#11201 See: pytorch/pytorch#114723 Signed-off-by: Dmitry Rogozhkin <[email protected]>

* Refactor DLRMv1 to models_v2 format (#2170) Signed-off-by: Minh1 Le<[email protected]> Signed-off-by: Mahathi Vatsal <[email protected]> * enable yolov7 (#2002) * update document for yolov7 and resnet50 (#2009) * Molly/yolov7 bkc update (#2021) * make num_iter flexbile * bugfix for bert-large ddp * bkc for rn50 ddp training update * bkc for rn50 ddp training update * bkc for dlrm_v1 ddp training update * yolov7 bkc update * fix yolov7 Int8 Inductor Dynamic shape issue (#2034) * minor fix for throughput (#2093) * Update yolov7 (#2209) * separate dataset setup from model runtime * create a script of yolov7.py for the heavy change in yolov7_ipex_and_inductor.patch * update document * do not count NMS time * add yolov7 to README (#2180) * YOLOv7 Inference container (#2154) * build initial container version * add libgl and tests * add pycocotools and more tests * add ubuntu dockerfile * add more dependencies * remove inductor tests * Update pytorch-yolov7-inference.Dockerfile-centos * remove extra components * Update pytorch-yolov7-inference.Dockerfile-centos * remove gcc source * add container doc * correct link * remove env var * remove bf32 * Update pytorch-yolov7-inference.Dockerfile-ubuntu * Update pytorch-yolov7-inference.Dockerfile-centos * modify container.md * Update CONTAINER.md * remove ISA * Update pytorch-yolov7-inference.Dockerfile-ubuntu * Update pytorch-yolov7-inference.Dockerfile-ubuntu * use default maloc conf * Enable PyTorch yolov7 inference (#2181) * enable PyTorch yolov7 inference * do not count NMS time * change the automatic download of the pre-trained model to manual download * add calibration.sh and description for int8 qparams json file * update document * minor changes * update document and add descriptions * use torch.compile with ipex backend for ipex int8 * Molly/refine summary output (#2205) * make num_iter flexbile * bugfix for bert-large ddp * bkc for rn50 ddp training update * bkc for rn50 ddp training update * bkc for dlrm_v1 ddp training update * refine summary outputs for pytorch cpu * refine for LCM * refine for distilbert * [Inductor][YoloV7] Enable Stock Pytorch launcher (#2159) * yolov7 enable torch launcher * fix log postfix issue * fix yolov7 throughput log bug * add num_warmup and num_iters for inductor path * fix coco path * Revert "fix coco path" This reverts commit 3c5f09fdc849229950cd4dc7cc2984939fd9dc2e. * add coco soft link * fix path * add warm-up and steps for ipex path * use NUMAS for yolov7 realtime instances * remove dataset download * merge develop * remove iter num for yolo * use static shape for IPEX int8 and improve its accuracy (#2078) * [TensorFlow]: Added HuggingFace model for BERT-large SQuAD for FP32 and BF16 (#1988) * * Added HuggingFace model for BERT-large SQuAD for FP32 and BF16. * Updated model scripts to be similar to estimator-based bert_large model for latency. NOTE: This is a SavedModel containing ReadVariable ops in the graph. Hence, it is not optimized for inference when running it with grappler-based TensorFlow. This SavedModel needs to be converted into a frozen graph when running with grappler-based TensorFlow for best performance. The original HuggingFace model can be found here: https://github.com/huggingface/transformers/tree/main/examples/tensorflow/question-answering Co-authored-by: mahathis <[email protected]> Co-authored-by: Nick Camarena <[email protected]> * Added public link for weights (#2200) * [TensorFlow] Enable fp16 and bfloat32 for Bert_large Hugging Face (#2215) * add fp16 support Co-authored-by: nick.camarena <[email protected]> * add xla to training_args list and enable variable batch size (#2268) * add xla to training_args list and enable variable batch size --------- Co-authored-by: AnetaKaczynska <[email protected]> * remove jit_compile=true to enable grappler AMP (#2280) * Enable Graphsage Inference (#1216) * enable graphsage model * clean up pretrained model path and fix amp issues (#1414) * added int8 support for graphsage (#1536) * changing numa core per instance (#1691) * GraphSAGE: added warmup steps (#1766) * added warmup steps * fix style error * added unit test fix * Graphsage: adding support for env var (#1836) * adding support for env var * Update correct number of cores * removing unnecessary condition * Graphsage: added advanced env var (#1848) * added advanced env var for graphsage * [Tensorflow]: Enable xla for GraphSAGE Inference (#2097) * Updated start.sh for graphsage * ipex/fbnet: add fbnet as enet clone (#2064) - FBNet sample is a clone of ENet from latest develop. - Uses huggingface (timm) instead of tochvision for model download - syncronizes only once per pass through dataset rather than once per batch as FBNet does - Includes monkey patch from MA KPI - Perf is matching MA KPI sample Signed-off-by: Voas, Tanner <[email protected]> * ipex: minor fixes for benchmark.sh in various samples - !bin/bash needs to be at top of file - profiles for fp32 included batch sizes that occasionally RTE due to lack of reosurces Signed-off-by: Voas, Tanner <[email protected]> * ipex/enet&fbnet: fix sample issues when running with data Signed-off-by: Voas, Tanner <[email protected]> * ipex/fbnet: switch to explicit device specification This change allows to run inference on the specified device (cpu, xpu or cuda) if few different devices present on the system. Signed-off-by: Dmitry Rogozhkin <[email protected]> * pytorch/fbnet: enable native xpu support path This commit adds native PyTorch xpu support for fbnet sample, i.e. IPEX is not needed in this mode. XPU backend in PyTorch is under active development and is not finished yet. Focus is on functional side of key things and performance is expected to be low. Future improvements should bring it up. As of now this mode of operation is experimental in the sample and is not default, use `--ipex yes` to enable. Commit also switches fbnet sample to torch variant of multiprocessing module and uses set_sharing_strategy('file_system') to avoid too many open files error on dataloader. * Can be run as: ./run_model.sh --ipex no * Tried on: * pytorch: 91d565da0c5 ("[dynamo] Add support for tensor's is_complex method") * vision: 89d2b38cbc ("Updated compatibility table") * Status: * --jit script|none: fail on autocast * --jit trace works x5.5 times slower vs. IPEX (16 img/s vs. 88 img/s), likely some operations are done on CPU since blitter is loaded + seeing this warning: torch/utils/data/_utils/pin_memory.py:58: UserWarning: Aten Op fallback from XPU to CPU happends. This may have performance implications. If need debug the fallback ops please set environment variable `PYTORCH_DEBUG_XPU_FALLBACK=1` (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:11.) /home/dvrogozh/git/pytorch/torch/nn/functional.py:2511: UserWarning: The operator 'aten::native_batch_normon the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch.batch_norm( /home/dvrogozh/git/pytorch/torch/nn/functional.py:1498: UserWarning: The operator 'aten::relu_on the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) result = torch.relu_(input) /home/dvrogozh/git/pytorch/torch/nn/functional.py:1260: UserWarning: The operator 'aten::_adaptive_avg_pool2don the XPU backend and will fall back to run on the CPU. (Triggered internally at /home/dvrogozh/git/pytorch/third_party/torch-xpu-ops/src/aten/XPUFallback.cpp:16.) return torch._C._nn.adaptive_avg_pool2d(input, _output_size) See: pytorch/pytorch#11201 See: pytorch/pytorch#114723 Signed-off-by: Dmitry Rogozhkin <[email protected]> * Sort requirement for fbnet * ipex: fixes for enet, fbnet, rife and ifrnet models * ipex/fbnet: minor fixes when loading the model * ipex/ifrnet: minor fixes on CUDA * ipex/rife: minor fixes on CUDA * ipex/enet: reduce memory usage on dataset execution at the cost of reduced throughput * ipex/fbnet: reduce memory usage on dataset execution at the cost of reduced throughput On memory usage fixes: * Change disables images pre-load on dataset mode * This has the effect of reducing dataset reported throughput on BS 128 with single-stream execution to 89% of dummy throughput on B4 version of enet and 22% on C100 version of fbnet: ** Note 1: sample execution takes similar amount of time. We now are just including data loading in throughput calculation vs dummy execution which excludes this processing. ** Note 2: the impact is much higher on fbnet because this model runs 10x the speed of enet B4. As such the impact of including data loading and processing is more significant. Signed-off-by: Voas, Tanner <[email protected]> * Fixed linter issues * Add Pytorch IFRNet Interpolation sample to models_v2 (#1962) * ipex: removed references to enet from samples that cloned it Signed-off-by: Voas, Tanner <[email protected]> * Additional Feature Support for IFRNet * Add new feature support: async, multi-stream, precision and amp * Use common js_sysinfo to report system config information * Align to schema in generated reports * Enable telemetry collection * CUDA Docker Images for Interpolation samples * ipex: minor fixes for benchmark.sh in various samples - !bin/bash needs to be at top of file - profiles for fp32 included batch sizes that occasionally RTE due to lack of reosurces Signed-off-by: Voas, Tanner <[email protected]> * Fixed linter issues * pytorch/ifrnet: enable native xpu support path This commit adds native PyTorch xpu support for ifrnet sample, i.e. IPEX is not needed in this mode. XPU backend in PyTorch is under active development and is not finished yet. Focus is on functional side of key things and performance is expected to be low. Future improvements should bring it up. As of now this mode of operation is experimental in the sample and is not default, use `--ipex no` to enable. * Can be run as: `./run_model.sh --ipex no` * Tried on: * pytorch: `91d565da0c5 ("[dynamo] Add support for tensor's is_complex method")` * vision: `89d2b38cbc ("Updated compatibility table")` * Status: * Loading weights with `map_location` fails (can be scipped) * `--precision bf16`: fails with `RuntimeError: grid_sampler_2d_cpu not implemented for BFloat16` * `--precision fp16`: fails with `RuntimeError: grid_sampler_2d_cpu not implemented for Half` * `--precision fp32`: works, perf is x4.4 times lower than IPEX (2.8 vs. 12.4 frames/s), cpu fallback occurs See: pytorch/pytorch#114723 Signed-off-by: Dmitry Rogozhkin <[email protected]> * ipex: fixes for enet, fbnet, rife and ifrnet models * ipex/fbnet: minor fixes when loading the model * ipex/ifrnet: minor fixes on CUDA * ipex/rife: minor fixes on CUDA * ipex/enet: reduce memory usage on dataset execution at the cost of reduced throughput * ipex/fbnet: reduce memory usage on dataset execution at the cost of reduced throughput On memory usage fixes: * Change disables images pre-load on dataset mode * This has the effect of reducing dataset reported throughput on BS 128 with single-stream execution to 89% of dummy throughput on B4 version of enet and 22% on C100 version of fbnet: ** Note 1: sample execution takes similar amount of time. We now are just including data loading in throughput calculation vs dummy execution which excludes this processing. ** Note 2: the impact is much higher on fbnet because this model runs 10x the speed of enet B4. As such the impact of including data loading and processing is more significant. Signed-off-by: Voas, Tanner <[email protected]> * IFRNet improvements + alignment of RIFE/IFRNet * Resolve issues in dataset mode * Added printout of test summary * Exposed resolution controls * Implemented better progress updates Signed-off-by: Voas, Tanner <[email protected]> * ipex/ifrnet&rife: report accuracy in run printout progress * ipex/ifrnet: add accuracy printout in dataset mode * ipex/rife: add accuracy printout in dataset mode Signed-off-by: Voas, Tanner <[email protected]> * Add Pytorch RIFE Interpolation sample to models_v2 (#2059) - Initial version of RIFE code, supporting 1 stream, Float16 XPU execution, BatchSize=1 with performance/quality modes - Included a ReadMe document for RIFE - Added docker files, baremetal scripts and basic tests - Pull ArXiV version of RIFE and apply local patches -- Using ArXiV version of model and corresponding weights. This is referenced by the main repository as the one corresponding to the ArXiV publication for RIFE (https://github.com/megvii-research/ECCV2022-RIFE#evaluation) -- Added patch file on top of model for XPU support -- Implemented get_model.sh script for RIFE to fetch and patch model - Added top level readme and copyright modifications Related Jira: https://jira.devtools.intel.com/browse/AIAE-336 * Additional Feature Support for RIFE * Add async submission configurability * Multi-stream support for RIFE * Add precision and AMP support * Align to reporting schema * ipex: removed references to enet from samples that cloned it Signed-off-by: Voas, Tanner <[email protected]> * Enable Telemetry and benchmarking scripts for RIFE * CUDA Docker Images for Interpolation samples * ipex: minor fixes for benchmark.sh in various samples - !bin/bash needs to be at top of file - profiles for fp32 included batch sizes that occasionally RTE due to lack of reosurces Signed-off-by: Voas, Tanner <[email protected]> * pytorch/rife: enable native xpu support path This commit adds native PyTorch xpu support for rife sample, i.e. IPEX is not needed in this mode. XPU backend in PyTorch is under active development and is not finished yet. Focus is on functional side of key things and performance is expected to be low. Future improvements should bring it up. As of now this mode of operation is experimental in the sample and is not default, use `--ipex no` to enable. * Can be run as: `./run_model.sh --ipex no` * Tried on: * pytorch: `91d565da0c5 ("[dynamo] Add support for tensor's is_complex method")` * vision: `89d2b38cbc ("Updated compatibility table")` * Status: * Loading weights with `map_location=torch.device('xpu')` fails (can be substituited with `map_location=torch.device('cpu')`) * `--precision bf16`: fails with `RuntimeError: grid_sampler_2d_cpu not implemented for BFloat16` * `--precision fp16`: fails with `RuntimeError: grid_sampler_2d_cpu not implemented for Half` * `--precision fp32`: works, perf is 6.6 times lower than IPEX (2.5 vs. 16.5 frames/s), cpu fallback occurs See: pytorch/pytorch#114723 Signed-off-by: Dmitry Rogozhkin <[email protected]> * ipex: fixes for enet, fbnet, rife and ifrnet models * ipex/fbnet: minor fixes when loading the model * ipex/ifrnet: minor fixes on CUDA * ipex/rife: minor fixes on CUDA * ipex/enet: reduce memory usage on dataset execution at the cost of reduced throughput * ipex/fbnet: reduce memory usage on dataset execution at the cost of reduced throughput On memory usage fixes: * Change disables images pre-load on dataset mode * This has the effect of reducing dataset reported throughput on BS 128 with single-stream execution to 89% of dummy throughput on B4 version of enet and 22% on C100 version of fbnet: ** Note 1: sample execution takes similar amount of time. We now are just including data loading in throughput calculation vs dummy execution which excludes this processing. ** Note 2: the impact is much higher on fbnet because this model runs 10x the speed of enet B4. As such the impact of including data loading and processing is more significant. Signed-off-by: Voas, Tanner <[email protected]> * ipex/refe: sample improvements * Added printout of test summary * Exposed resolution controls * Implemented better progress updates Signed-off-by: Voas, Tanner <[email protected]> * IFRNet improvements + alignment of RIFE/IFRNet * Resolve issues in dataset mode * Added printout of test summary * Exposed resolution controls * Implemented better progress updates Signed-off-by: Voas, Tanner <[email protected]> * ipex/ifrnet&rife: report accuracy in run printout progress * ipex/ifrnet: add accuracy printout in dataset mode * ipex/rife: add accuracy printout in dataset mode Signed-off-by: Voas, Tanner <[email protected]> * Sort requirements.txt * bkc update (#2288) Co-authored-by: Chunyuan WU <[email protected]> * update launcher usage for yolov7 (#2290) * update yolov7 patch to support drop_last for performance test and enable more than 1 inference epoch (#2321) --------- Signed-off-by: Minh1 Le<[email protected]> Signed-off-by: Mahathi Vatsal <[email protected]> Signed-off-by: Voas, Tanner <[email protected]> Signed-off-by: Dmitry Rogozhkin <[email protected]> Co-authored-by: Cao E <[email protected]> Co-authored-by: Srikanth Ramakrishna <[email protected]> Co-authored-by: WeizhuoZhang-intel <[email protected]> Co-authored-by: Bhavani Subramanian <[email protected]> Co-authored-by: Nick Camarena <[email protected]> Co-authored-by: AnetaKaczynska <[email protected]> Co-authored-by: Ashiq Imran <[email protected]> Co-authored-by: Dmitry Rogozhkin <[email protected]> Co-authored-by: sandeep-maddipatla <[email protected]> Co-authored-by: sandeep-maddipatla <[email protected]> Co-authored-by: Chunyuan WU <[email protected]> Co-authored-by: xiaofeij <[email protected]>

ezyang added the high priority label Sep 4, 2018

weiyangfb self-assigned this Sep 5, 2018

soumith added medium priority and removed high priority labels Sep 10, 2018

weiyangfb mentioned this issue Sep 13, 2018

[wip] fixing too many open file error at DataLoader #11626

Closed

weiyangfb closed this as completed Sep 14, 2018

weiyangfb mentioned this issue Sep 20, 2018

torch.tensors in torch.multiprocessing #11899

Closed

minafarid mentioned this issue Nov 10, 2018

PyTorch multiprocessing pool causes OSError Too Many Files Open on Linux HoloClean/holoclean#8

Closed

pldelisle mentioned this issue Aug 19, 2019

[FEATURE] Distributed computing support banctilrobitaille/kerosene#5

Closed

baldassarreFe added a commit to baldassarreFe/ws-vrd that referenced this issue Feb 12, 2020

Fixed 'Too many open files' issue whe using num_workers>0 pytorch/pyt…

4c43c19

…orch#11201

kakusikun added a commit to kakusikun/deep-learning-works that referenced this issue May 7, 2020

fix, pytorch/pytorch#11201

c28c553

XiaohangZhan mentioned this issue Jun 1, 2020

OS error 24 XiaohangZhan/face_recognition_framework#5

Closed

vqdang mentioned this issue Jan 6, 2021

FIX: deep copy to prevent creating too many fptr vqdang/hover_net#80

Closed

u2400 mentioned this issue Feb 25, 2021

RuntimeError: unable to open shared memory object </torch_29919_1396182366> in read-write mode facebookresearch/maskrcnn-benchmark#103

Open

Flamefire mentioned this issue Mar 29, 2021

{devel}[foss/2020a,fosscuda/2020a] PyTorch v1.6.0 w/ Python 3.8.2 easybuilders/easybuild-easyconfigs#12066

Closed

louismartin mentioned this issue May 27, 2021

train model failed again facebookresearch/muss#8

Closed

DamonDeng mentioned this issue Jun 19, 2021

test_dataloader.py fails to pass test with error: Can't get attribute 'RandomDataset'... on MacOS #60319

Closed

santhnm2 mentioned this issue Jun 29, 2021

Add GPT transform and grid search microsoft/dist-ir#27

Merged

peterdudfield mentioned this issue Jul 14, 2022

Trouble training openclimatefix/predict_pv_yield#110

Closed

geometrikal added a commit to microfossil/particle-object-detection that referenced this issue Nov 30, 2022

Set worker strategy to fix bug

41f7dd8

pytorch/pytorch#11201

verystrongjoe added a commit to verystrongjoe/wafer_aug_rl that referenced this issue Dec 8, 2022

bugfix for pytorch/pytorch#11201

0d22b10

jcohenadad mentioned this issue Jan 31, 2023

OSError: [Errno 24] Too many open files jcohenadad/model-seg-ms-mp2rage-monai#1

Closed

tristanengst mentioned this issue Feb 1, 2023

[CLI]: Opening runs leaks file pointers and semaphores wandb/wandb#3974

Closed

julienroyd added a commit to recursionpharma/gflownet that referenced this issue Mar 24, 2023

fix: attempt to deepcopy tensors coming from Dataloaders to release t…

f2690d3

…he file handles, see: pytorch/pytorch#11201 (comment)

julienroyd added a commit to recursionpharma/gflownet that referenced this issue Mar 24, 2023

fix: also try to change the sharing_strategy for each worker, see: py…

795c6b6

…torch/pytorch#11201 (comment)

louisfh mentioned this issue May 31, 2023

RuntimeError: Pin memory thread exited unexpectedly kitzeslab/opensoundscape#726

Closed

Ben-Louis mentioned this issue Aug 3, 2023

[Bug] batch_size设置大，而出现too many open files open-mmlab/mmpose#2595

Closed

2 tasks

kaczmarj mentioned this issue Aug 11, 2023

clone batch_coords tensor to prevent Too many open files error SBU-BMI/wsinfer#182

Merged

chanshing added a commit to OxWearables/stepcount that referenced this issue Aug 17, 2023

bugfix too many batches

0265b82

pytorch/pytorch#11201

chanshing added a commit to OxWearables/stepcount that referenced this issue Aug 29, 2023

bugfix too many batches

9ef1087

pytorch/pytorch#11201

jihanyang mentioned this issue Aug 31, 2023

Training Issue:OSError: [Errno 24] Too many open files CVMI-Lab/PLA#20

Closed

thorstenwagner added a commit to MPI-Dortmund/tomotwin-cryoet that referenced this issue Oct 17, 2023

use deepcopy and delete subvolume (fix suggested in pytorch/pytorch#1…

44cc4ef

…1201)

JacksonBurns mentioned this issue Jan 23, 2024

[QUESTION]: Too many files open error? chemprop/chemprop#595

Closed

GeorgWa mentioned this issue Jan 23, 2024

FIX torch MP bug on Linux MannLabs/alphapeptdeep#134

Closed

yihong0618 mentioned this issue Feb 9, 2024

python tracing failure with RuntimeError: Too many open files namhyung/uftrace#1886

Closed

bdecost mentioned this issue Apr 20, 2024

Distributed Multi GPU training usnistgov/nfflr#1

Open

duhtapioca mentioned this issue Jun 26, 2024

Feature calculation process crashing with large dataset lhotse-speech/lhotse#1364

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many open files error #11201

Too many open files error #11201

whucdf commented Sep 3, 2018 •

edited

Loading

weiyangfb commented Sep 13, 2018

weiyangfb commented Sep 14, 2018

zimenglan-sysu-512 commented Dec 28, 2018

cyzanfar commented Mar 9, 2019

Beastmaster commented Apr 24, 2019

soulslicer commented Feb 13, 2020

brando90 commented Feb 28, 2021 •

edited

Loading

FarisHijazi commented Jul 11, 2021

schuhschuh commented Aug 9, 2021 •

edited

Loading

mdabbah commented Aug 24, 2021

basilevh commented Jul 19, 2022 •

edited

Loading

Xonxt commented Jul 23, 2022 •

edited

Loading

LemurPwned commented Jul 27, 2022

xiyanghu commented Jul 29, 2022

Xonxt commented Jul 29, 2022 •

edited

Loading

Too many open files error #11201

Too many open files error #11201

Comments

whucdf commented Sep 3, 2018 • edited Loading

Issue description

Code example

System Info

weiyangfb commented Sep 13, 2018

weiyangfb commented Sep 14, 2018

zimenglan-sysu-512 commented Dec 28, 2018

cyzanfar commented Mar 9, 2019

Beastmaster commented Apr 24, 2019

soulslicer commented Feb 13, 2020

brando90 commented Feb 28, 2021 • edited Loading

FarisHijazi commented Jul 11, 2021

schuhschuh commented Aug 9, 2021 • edited Loading

mdabbah commented Aug 24, 2021

basilevh commented Jul 19, 2022 • edited Loading

Xonxt commented Jul 23, 2022 • edited Loading

LemurPwned commented Jul 27, 2022

xiyanghu commented Jul 29, 2022

Xonxt commented Jul 29, 2022 • edited Loading

whucdf commented Sep 3, 2018 •

edited

Loading

brando90 commented Feb 28, 2021 •

edited

Loading

schuhschuh commented Aug 9, 2021 •

edited

Loading

basilevh commented Jul 19, 2022 •

edited

Loading

Xonxt commented Jul 23, 2022 •

edited

Loading

Xonxt commented Jul 29, 2022 •

edited

Loading