SIGSERGV message when training UNet #16475

NJeanray · 2023-01-23T09:03:35Z

Bug description

Hello,

When I try to train my UNet, here is the message I get :

Training started at 2023-01-19 16:09:51.050569

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/24
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/24
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/24
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/24
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/24
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/24
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/24
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/24
Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/24
Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/24
Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/24
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/24
Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/24
Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/24
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/24
Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/24
Initializing distributed: GLOBAL_RANK: 16, MEMBER: 17/24
Initializing distributed: GLOBAL_RANK: 17, MEMBER: 18/24
Initializing distributed: GLOBAL_RANK: 18, MEMBER: 19/24
Initializing distributed: GLOBAL_RANK: 19, MEMBER: 20/24
Initializing distributed: GLOBAL_RANK: 20, MEMBER: 21/24
Initializing distributed: GLOBAL_RANK: 21, MEMBER: 22/24
Initializing distributed: GLOBAL_RANK: 22, MEMBER: 23/24
Initializing distributed: GLOBAL_RANK: 23, MEMBER: 24/24
----------------------------------------------------------------------------------------------------
distributed_backend=gloo
All distributed processes registered. Starting with 24 processes
----------------------------------------------------------------------------------------------------

---------------------------------------------------------------------------
ProcessExitedException                    Traceback (most recent call last)
Input In [58], in <cell line: 5>()
      3 print('Training started at', start)
      4 # model : (LightningModule) – Model to fit
----> 5 trainer.fit(model, train_loader, val_loader )
      6 print('Training duration:', datetime.now() - start)

File ~/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:608, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
    606     raise TypeError(f"`Trainer.fit()` requires a `LightningModule`, got: {model.__class__.__qualname__}")
    607 self.strategy._lightning_module = model
--> 608 call._call_and_handle_interrupt(
    609     self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    610 )

File ~/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py:36, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
     34 try:
     35     if trainer.strategy.launcher is not None:
---> 36         return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
     37     else:
     38         return trainer_fn(*args, **kwargs)

File ~/anaconda3/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:113, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
    110 else:
    111     process_args = [trainer, function, args, kwargs, return_queue]
--> 113 mp.start_processes(
    114     self._wrapping_function,
    115     args=process_args,
    116     nprocs=self._strategy.num_processes,
    117     start_method=self._start_method,
    118 )
    119 worker_output = return_queue.get()
    120 if trainer is None:

File ~/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:198, in start_processes(fn, args, nprocs, join, daemon, start_method)
    195     return context
    197 # Loop on join until it returns True or raises an exception.
--> 198 while not context.join():
    199     pass

File ~/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:140, in ProcessContext.join(self, timeout)
    138 if exitcode < 0:
    139     name = signal.Signals(-exitcode).name
--> 140     raise ProcessExitedException(
    141         "process %d terminated with signal %s" %
    142         (error_index, name),
    143         error_index=error_index,
    144         error_pid=failed_process.pid,
    145         exit_code=exitcode,
    146         signal_name=name
    147     )
    148 else:
    149     raise ProcessExitedException(
    150         "process %d terminated with exit code %d" %
    151         (error_index, exitcode),
   (...)
    154         exit_code=exitcode
    155     )

ProcessExitedException: process 11 terminated with signal SIGSEGV

It seems to happen quite randomly, as sometimes, the error concerns process 3, or another one.

How to reproduce the bug

import torch
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger


# We use a standard Lightning model, where the logic for the training and validation steps is defined.
# Full Segmentation Model
class TumorSegmentation(pl.LightningModule):
    def __init__(self):
        super().__init__()
        
        self.model = UNet()
        
        self.optimizer = torch.optim.Adam(self.model.parameters(), lr=1e-4)
        self.loss_fn = torch.nn.BCEWithLogitsLoss()
    
    def forward(self, data):
        pred = self.model(data)
        return pred
    
    def training_step(self, batch, batch_idx):
        ct, mask = batch
        mask = mask.float()
        
        pred = self(ct)
        loss = self.loss_fn(pred, mask)
        
        # Logs
        self.log("Train Dice", loss)
        if batch_idx % 50 == 0:
            self.log_images(ct.cpu(), pred.cpu(), mask.cpu(), "Train")
        return loss
    
        
    def validation_step(self, batch, batch_idx):
        ct, mask = batch
        mask = mask.float()

        pred = self(ct)
        loss = self.loss_fn(pred, mask)
        
        # Logs
        self.log("Val Dice", loss)
        if batch_idx % 50 == 0:
            self.log_images(ct.cpu(), pred.cpu(), mask.cpu(), "Val")
        
        return loss

    
    def log_images(self, ct, pred, mask, name):
        
        results = []
        
        pred = pred > 0.5 # As we use the sigomid activation function, we threshold at 0.5
        
        
        fig, axis = plt.subplots(1, 2)
        axis[0].imshow(ct[0][0], cmap="bone")
        mask_ = np.ma.masked_where(mask[0][0]==0, mask[0][0])
        axis[0].imshow(mask_, alpha=0.6)
        axis[0].set_title("Ground Truth")
        
        axis[1].imshow(ct[0][0], cmap="bone")
        mask_ = np.ma.masked_where(pred[0][0]==0, pred[0][0])
        axis[1].imshow(mask_, alpha=0.6, cmap="autumn")
        axis[1].set_title("Pred")

        self.logger.experiment.add_figure(f"{name} Prediction vs Label", fig, self.global_step)

            
    
    def configure_optimizers(self):
        #We always need to return a list here (just pack our optimizer into one :))
        return [self.optimizer]



# Model instanciation
model = TumorSegmentation()

# Create the trainer
trainer = pl.Trainer(accelerator='cpu',devices=24, logger=TensorBoardLogger(save_dir="/home/xxx/yyy/zzz/UNet/logs"), log_every_n_steps=1,
                     callbacks=checkpoint_callback,
                     max_epochs=30)

# model : (LightningModule) – Model to fit
trainer.fit(model, train_loader, val_loader )



### Error messages and logs

Training started at 2023-01-19 16:09:51.050569

Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/24
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/24
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/24
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/24
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/24
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/24
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/24
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/24
Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/24
Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/24
Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/24
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/24
Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/24
Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/24
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/24
Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/24
Initializing distributed: GLOBAL_RANK: 16, MEMBER: 17/24
Initializing distributed: GLOBAL_RANK: 17, MEMBER: 18/24
Initializing distributed: GLOBAL_RANK: 18, MEMBER: 19/24
Initializing distributed: GLOBAL_RANK: 19, MEMBER: 20/24
Initializing distributed: GLOBAL_RANK: 20, MEMBER: 21/24
Initializing distributed: GLOBAL_RANK: 21, MEMBER: 22/24
Initializing distributed: GLOBAL_RANK: 22, MEMBER: 23/24
Initializing distributed: GLOBAL_RANK: 23, MEMBER: 24/24

distributed_backend=gloo
All distributed processes registered. Starting with 24 processes

ProcessExitedException Traceback (most recent call last)
Input In [58], in <cell line: 5>()
3 print('Training started at', start)
4 # model : (LightningModule) – Model to fit
----> 5 trainer.fit(model, train_loader, val_loader )
6 print('Training duration:', datetime.now() - start)

File ~/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:608, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
606 raise TypeError(f"Trainer.fit() requires a LightningModule, got: {model.class.qualname}")
607 self.strategy._lightning_module = model
--> 608 call._call_and_handle_interrupt(
609 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
610 )

File ~/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py:36, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
34 try:
35 if trainer.strategy.launcher is not None:
---> 36 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
37 else:
38 return trainer_fn(*args, **kwargs)

File ~/anaconda3/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:113, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
110 else:
111 process_args = [trainer, function, args, kwargs, return_queue]
--> 113 mp.start_processes(
114 self._wrapping_function,
115 args=process_args,
116 nprocs=self._strategy.num_processes,
117 start_method=self._start_method,
118 )
119 worker_output = return_queue.get()
120 if trainer is None:

File ~/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:198, in start_processes(fn, args, nprocs, join, daemon, start_method)
195 return context
197 # Loop on join until it returns True or raises an exception.
--> 198 while not context.join():
199 pass

File ~/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:140, in ProcessContext.join(self, timeout)
138 if exitcode < 0:
139 name = signal.Signals(-exitcode).name
--> 140 raise ProcessExitedException(
141 "process %d terminated with signal %s" %
142 (error_index, name),
143 error_index=error_index,
144 error_pid=failed_process.pid,
145 exit_code=exitcode,
146 signal_name=name
147 )
148 else:
149 raise ProcessExitedException(
150 "process %d terminated with exit code %d" %
151 (error_index, exitcode),
(...)
154 exit_code=exitcode
155 )

ProcessExitedException: process 11 terminated with signal SIGSEGV



### Environment

<details>

I'm running this script locally, on Ubuntu (please, see below for details).
 <summary>Current environment</summary>

CUDA:
- GPU: None
- available: False
- version: 11.7
Lightning:
- lightning-utilities: 0.5.0
- pytorch-lightning: 1.9.0
- torch: 1.13.1
- torchmetrics: 0.11.0
Packages:
- absl-py: 1.4.0
- aiohttp: 3.8.3
- aiosignal: 1.3.1
- alabaster: 0.7.12
- alembic: 1.9.2
- anaconda-client: 1.7.2
- anaconda-navigator: 1.10.0
- anaconda-project: 0.8.3
- appdirs: 1.4.4
- argh: 0.26.2
- argon2-cffi: 20.1.0
- asn1crypto: 1.4.0
- astroid: 2.4.2
- astropy: 4.0.2
- asttokens: 2.0.5
- astunparse: 1.6.3
- async-generator: 1.10
- async-timeout: 4.0.2
- atomicwrites: 1.4.0
- attrs: 20.3.0
- autopep8: 1.5.4
- babel: 2.8.1
- backcall: 0.2.0
- backports.functools-lru-cache: 1.6.4
- backports.shutil-get-terminal-size: 1.0.0
- backports.tempfile: 1.0
- backports.weakref: 1.0.post1
- banal: 1.0.6
- beautifulsoup4: 4.9.3
- bio: 1.3.5
- biopython: 1.79
- bitarray: 1.6.1
- bkcharts: 0.2
- bleach: 3.2.1
- bokeh: 2.2.3
- boto: 2.49.0
- bottleneck: 1.3.2
- brotlipy: 0.7.0
- cachetools: 5.2.1
- celluloid: 0.2.0
- certifi: 2020.6.20
- cffi: 1.14.3
- chardet: 3.0.4
- charset-normalizer: 2.1.1
- click: 7.1.2
- clldutils: 3.11.1
- cloudpickle: 1.6.0
- clyent: 1.2.2
- colorama: 0.4.4
- coloredlogs: 15.0.1
- colorlog: 6.6.0
- colormath: 3.0.0
- commonmark: 0.9.1
- conda: 22.9.0
- conda-build: 3.20.5
- conda-package-handling: 1.7.3
- conda-verify: 3.4.2
- contextlib2: 0.6.0.post1
- contourpy: 1.0.7
- cryptography: 3.1.1
- csvw: 2.0.0
- cycler: 0.10.0
- cython: 0.29.21
- cytoolz: 0.11.0
- dask: 2.30.0
- debugpy: 1.6.0
- decorator: 4.4.2
- deeptools: 3.5.1
- deeptoolsintervals: 0.1.9
- defusedxml: 0.6.0
- diff-match-patch: 20200713
- distributed: 2.30.1
- docutils: 0.16
- efficientnet: 1.0.0
- entrypoints: 0.3
- et-xmlfile: 1.0.1
- executing: 0.8.3
- fastcache: 1.1.0
- ffmpeg-python: 0.2.0
- filelock: 3.0.12
- flake8: 3.8.4
- flask: 1.1.2
- flatbuffers: 23.1.4
- fonttools: 4.38.0
- frozenlist: 1.3.3
- fsspec: 2022.11.0
- future: 0.18.2
- gast: 0.4.0
- gevent: 20.9.0
- glob2: 0.7
- gmpy2: 2.0.8
- google-auth: 2.16.0
- google-auth-oauthlib: 0.4.6
- google-pasta: 0.2.0
- graphviz: 0.8.4
- greenlet: 0.4.17
- grpcio: 1.51.1
- gtf2csv: 0.2
- h5py: 2.10.0
- heapdict: 1.0.1
- html5lib: 1.1
- humanfriendly: 10.0
- idna: 2.10
- igv-reports: 1.0.4
- image-classifiers: 1.0.0
- imageio: 2.9.0
- imagesize: 1.2.0
- imgaug: 0.4.0
- importlib-metadata: 4.8.2
- importlib-resources: 5.10.2
- iniconfig: 1.1.1
- intervaltree: 3.1.0
- ipykernel: 6.13.0
- ipympl: 0.9.2
- ipython: 8.3.0
- ipython-genutils: 0.2.0
- ipywidgets: 8.0.4
- isort: 5.6.4
- itsdangerous: 1.1.0
- jdcal: 1.4.1
- jedi: 0.17.1
- jeepney: 0.5.0
- jinja2: 2.11.2
- joblib: 0.17.0
- json5: 0.9.5
- jsonschema: 3.2.0
- jupyter: 1.0.0
- jupyter-client: 7.3.0
- jupyter-console: 6.2.0
- jupyter-core: 4.10.0
- jupyterlab: 2.2.6
- jupyterlab-pygments: 0.1.2
- jupyterlab-server: 1.2.0
- jupyterlab-widgets: 3.0.5
- kaleido: 0.0.3
- keras: 2.11.0
- keras-applications: 1.0.8
- keras-preprocessing: 1.1.2
- keyring: 21.4.0
- kiwisolver: 1.3.0
- latexcodec: 2.0.1
- lazy-object-proxy: 1.4.3
- libarchive-c: 2.9
- libclang: 15.0.6.1
- lightning-utilities: 0.5.0
- lingpy: 2.6.9
- llvmlite: 0.34.0
- locket: 0.2.0
- lxml: 4.6.1
- lzstring: 1.0.4
- mako: 1.2.4
- mamba: 0.7.3
- markdown: 3.3.6
- markupsafe: 1.1.1
- matplotlib: 3.6.3
- matplotlib-inline: 0.1.3
- mccabe: 0.6.1
- mistune: 0.8.4
- mkl-fft: 1.2.0
- mkl-random: 1.1.1
- mkl-service: 2.3.0
- mock: 4.0.2
- more-itertools: 8.6.0
- mpmath: 1.1.0
- msgpack: 1.0.0
- multidict: 6.0.4
- multipledispatch: 0.6.0
- multiqc: 1.11
- mxnet-mkl: 1.6.0
- navigator-updater: 0.2.1
- nbclient: 0.5.1
- nbconvert: 6.0.7
- nbformat: 5.0.8
- ncbi-genome-download: 0.3.1
- neo4j: 4.4.1
- nest-asyncio: 1.4.2
- networkx: 2.6.3
- nibabel: 5.0.0
- nltk: 3.5
- nodejs: 0.1.1
- nose: 1.3.7
- notebook: 6.1.4
- numba: 0.51.2
- numexpr: 2.7.1
- numpy: 1.23.1
- numpydoc: 1.1.0
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- oauthlib: 3.2.2
- olefile: 0.46
- opencv-python: 4.7.0.68
- openpyxl: 3.0.5
- opt-einsum: 3.3.0
- optional-django: 0.1.0
- packaging: 20.4
- pandas: 1.1.3
- pandocfilters: 1.4.3
- parso: 0.7.0
- partd: 1.1.0
- path: 15.0.0
- pathlib2: 2.3.5
- pathtools: 0.1.2
- patsy: 0.5.1
- pep8: 1.7.1
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 8.0.1
- pip: 20.2.4
- pkginfo: 1.6.1
- plotly: 5.9.0
- pluggy: 0.13.1
- ply: 3.11
- prometheus-client: 0.8.0
- prompt-toolkit: 3.0.8
- protobuf: 3.19.6
- psutil: 5.7.2
- ptyprocess: 0.6.0
- pure-eval: 0.2.2
- py: 1.9.0
- py2bit: 0.3.0
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pybigwig: 0.3.17
- pybtex: 0.24.0
- pycldf: 1.25.1
- pycodestyle: 2.6.0
- pycosat: 0.6.3
- pycparser: 2.20
- pycurl: 7.43.0.6
- pydicom: 2.3.0
- pydocstyle: 5.1.1
- pyflakes: 2.2.0
- pygments: 2.7.2
- pylint: 2.6.0
- pyodbc: 4.0.0-unsupported
- pyopenssl: 19.1.0
- pyparsing: 2.4.7
- pyrsistent: 0.17.3
- pysam: 0.16.0.1
- pysocks: 1.7.1
- pytest: 0.0.0
- python-dateutil: 2.8.2
- python-jsonrpc-server: 0.4.0
- python-language-server: 0.35.1
- pytorch-lightning: 1.9.0
- pytz: 2020.1
- pywavelets: 1.1.1
- pyxdg: 0.27
- pyyaml: 6.0
- pyzmq: 22.3.0
- qdarkstyle: 2.8.1
- qtawesome: 1.0.1
- qtconsole: 4.7.7
- qtpy: 1.9.0
- regex: 2020.10.15
- requests: 2.24.0
- requests-oauthlib: 1.3.1
- retrying: 1.3.3
- rfc3986: 1.5.0
- rich: 10.16.0
- rope: 0.18.0
- rsa: 4.9
- rtree: 0.9.4
- ruamel-yaml: 0.15.87
- scikit-image: 0.17.2
- scikit-learn: 0.23.2
- scipy: 1.5.2
- seaborn: 0.11.0
- secretstorage: 3.1.2
- segmentation-models: 1.0.1
- send2trash: 1.5.0
- setuptools: 50.3.1.post20201107
- shapely: 2.0.0
- simplegeneric: 0.8.1
- simplejson: 3.17.6
- singledispatch: 3.4.0.3
- sip: 4.19.13
- six: 1.16.0
- snowballstemmer: 2.0.0
- sortedcollections: 1.2.1
- sortedcontainers: 2.2.2
- soupsieve: 2.0.1
- spectra: 0.0.11
- sphinx: 3.2.1
- sphinxcontrib-applehelp: 1.0.2
- sphinxcontrib-devhelp: 1.0.2
- sphinxcontrib-htmlhelp: 1.0.3
- sphinxcontrib-jsmath: 1.0.1
- sphinxcontrib-qthelp: 1.0.3
- sphinxcontrib-serializinghtml: 1.1.4
- sphinxcontrib-websupport: 1.2.4
- spyder: 4.1.5
- spyder-kernels: 1.9.4
- sqlalchemy: 1.3.20
- stack-data: 0.2.0
- statsmodels: 0.12.0
- sympy: 1.6.2
- tables: 3.6.1
- tabulate: 0.8.9
- tblib: 1.7.0
- tenacity: 8.0.1
- tensorboard: 2.11.2
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- tensorflow: 2.11.0
- tensorflow-estimator: 2.11.0
- tensorflow-io-gcs-filesystem: 0.29.0
- termcolor: 2.2.0
- terminado: 0.9.1
- testpath: 0.4.4
- threadpoolctl: 2.1.0
- tifffile: 2020.10.1
- toml: 0.10.1
- toolz: 0.11.1
- torch: 1.13.1
- torchmetrics: 0.11.0
- tornado: 5.1.1
- tqdm: 4.63.1
- traitlets: 5.1.1
- typing-extensions: 4.4.0
- ujson: 4.0.1
- unicodecsv: 0.14.1
- uritemplate: 4.1.1
- urllib3: 1.25.11
- watchdog: 0.10.3
- wcwidth: 0.2.5
- webencodings: 0.5.1
- werkzeug: 1.0.1
- wheel: 0.35.1
- widgetsnbextension: 4.0.5
- wrapt: 1.11.2
- wurlitzer: 2.0.1
- xlrd: 1.2.0
- xlsxwriter: 1.3.7
- xlwt: 1.3.0
- xmltodict: 0.12.0
- yapf: 0.30.0
- yarl: 1.8.2
- zict: 2.0.0
- zipp: 3.4.0
- zope.event: 4.5.0
- zope.interface: 5.1.2
System:
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: How to set hyperparameters search range and run the search? #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022


</details>


### More info

_No response_

The text was updated successfully, but these errors were encountered:

stale · 2023-03-19T23:52:49Z

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

NJeanray added the needs triage Waiting to be triaged by maintainers label Jan 23, 2023

stale bot added the won't fix This will not be worked on label Mar 19, 2023

stale bot closed this as completed Apr 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIGSERGV message when training UNet #16475

SIGSERGV message when training UNet #16475

NJeanray commented Jan 23, 2023

stale bot commented Mar 19, 2023

SIGSERGV message when training UNet #16475

SIGSERGV message when training UNet #16475

Comments

NJeanray commented Jan 23, 2023

Bug description

How to reproduce the bug

distributed_backend=gloo All distributed processes registered. Starting with 24 processes

stale bot commented Mar 19, 2023

distributed_backend=gloo
All distributed processes registered. Starting with 24 processes