You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems to happen quite randomly, as sometimes, the error concerns process 3, or another one.
How to reproduce the bug
importtorchimportpytorch_lightningasplfrompytorch_lightning.callbacksimportModelCheckpointfrompytorch_lightning.loggersimportTensorBoardLogger# We use a standard Lightning model, where the logic for the training and validation steps is defined.# Full Segmentation ModelclassTumorSegmentation(pl.LightningModule):
def__init__(self):
super().__init__()
self.model=UNet()
self.optimizer=torch.optim.Adam(self.model.parameters(), lr=1e-4)
self.loss_fn=torch.nn.BCEWithLogitsLoss()
defforward(self, data):
pred=self.model(data)
returnpreddeftraining_step(self, batch, batch_idx):
ct, mask=batchmask=mask.float()
pred=self(ct)
loss=self.loss_fn(pred, mask)
# Logsself.log("Train Dice", loss)
ifbatch_idx%50==0:
self.log_images(ct.cpu(), pred.cpu(), mask.cpu(), "Train")
returnlossdefvalidation_step(self, batch, batch_idx):
ct, mask=batchmask=mask.float()
pred=self(ct)
loss=self.loss_fn(pred, mask)
# Logsself.log("Val Dice", loss)
ifbatch_idx%50==0:
self.log_images(ct.cpu(), pred.cpu(), mask.cpu(), "Val")
returnlossdeflog_images(self, ct, pred, mask, name):
results= []
pred=pred>0.5# As we use the sigomid activation function, we threshold at 0.5fig, axis=plt.subplots(1, 2)
axis[0].imshow(ct[0][0], cmap="bone")
mask_=np.ma.masked_where(mask[0][0]==0, mask[0][0])
axis[0].imshow(mask_, alpha=0.6)
axis[0].set_title("Ground Truth")
axis[1].imshow(ct[0][0], cmap="bone")
mask_=np.ma.masked_where(pred[0][0]==0, pred[0][0])
axis[1].imshow(mask_, alpha=0.6, cmap="autumn")
axis[1].set_title("Pred")
self.logger.experiment.add_figure(f"{name} Prediction vs Label", fig, self.global_step)
defconfigure_optimizers(self):
#We always need to return a list here (just pack our optimizer into one :))return [self.optimizer]
# Model instanciationmodel=TumorSegmentation()
# Create the trainertrainer=pl.Trainer(accelerator='cpu',devices=24, logger=TensorBoardLogger(save_dir="/home/xxx/yyy/zzz/UNet/logs"), log_every_n_steps=1,
callbacks=checkpoint_callback,
max_epochs=30)
# model : (LightningModule) – Model to fittrainer.fit(model, train_loader, val_loader )
File ~/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:198, in start_processes(fn, args, nprocs, join, daemon, start_method)
195 return context
197 # Loop on join until it returns True or raises an exception.
--> 198 while not context.join():
199 pass
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!
Bug description
Hello,
When I try to train my UNet, here is the message I get :
It seems to happen quite randomly, as sometimes, the error concerns process 3, or another one.
How to reproduce the bug
Training started at 2023-01-19 16:09:51.050569
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/24
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/24
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/24
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/24
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/24
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/24
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/24
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/24
Initializing distributed: GLOBAL_RANK: 8, MEMBER: 9/24
Initializing distributed: GLOBAL_RANK: 9, MEMBER: 10/24
Initializing distributed: GLOBAL_RANK: 10, MEMBER: 11/24
Initializing distributed: GLOBAL_RANK: 11, MEMBER: 12/24
Initializing distributed: GLOBAL_RANK: 12, MEMBER: 13/24
Initializing distributed: GLOBAL_RANK: 13, MEMBER: 14/24
Initializing distributed: GLOBAL_RANK: 14, MEMBER: 15/24
Initializing distributed: GLOBAL_RANK: 15, MEMBER: 16/24
Initializing distributed: GLOBAL_RANK: 16, MEMBER: 17/24
Initializing distributed: GLOBAL_RANK: 17, MEMBER: 18/24
Initializing distributed: GLOBAL_RANK: 18, MEMBER: 19/24
Initializing distributed: GLOBAL_RANK: 19, MEMBER: 20/24
Initializing distributed: GLOBAL_RANK: 20, MEMBER: 21/24
Initializing distributed: GLOBAL_RANK: 21, MEMBER: 22/24
Initializing distributed: GLOBAL_RANK: 22, MEMBER: 23/24
Initializing distributed: GLOBAL_RANK: 23, MEMBER: 24/24
distributed_backend=gloo
All distributed processes registered. Starting with 24 processes
ProcessExitedException Traceback (most recent call last)
Input In [58], in <cell line: 5>()
3 print('Training started at', start)
4 # model : (LightningModule) – Model to fit
----> 5 trainer.fit(model, train_loader, val_loader )
6 print('Training duration:', datetime.now() - start)
File ~/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py:608, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
606 raise TypeError(f"
Trainer.fit()
requires aLightningModule
, got: {model.class.qualname}")607 self.strategy._lightning_module = model
--> 608 call._call_and_handle_interrupt(
609 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
610 )
File ~/anaconda3/lib/python3.8/site-packages/pytorch_lightning/trainer/call.py:36, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
34 try:
35 if trainer.strategy.launcher is not None:
---> 36 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
37 else:
38 return trainer_fn(*args, **kwargs)
File ~/anaconda3/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/multiprocessing.py:113, in _MultiProcessingLauncher.launch(self, function, trainer, *args, **kwargs)
110 else:
111 process_args = [trainer, function, args, kwargs, return_queue]
--> 113 mp.start_processes(
114 self._wrapping_function,
115 args=process_args,
116 nprocs=self._strategy.num_processes,
117 start_method=self._start_method,
118 )
119 worker_output = return_queue.get()
120 if trainer is None:
File ~/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:198, in start_processes(fn, args, nprocs, join, daemon, start_method)
195 return context
197 # Loop on join until it returns True or raises an exception.
--> 198 while not context.join():
199 pass
File ~/anaconda3/lib/python3.8/site-packages/torch/multiprocessing/spawn.py:140, in ProcessContext.join(self, timeout)
138 if exitcode < 0:
139 name = signal.Signals(-exitcode).name
--> 140 raise ProcessExitedException(
141 "process %d terminated with signal %s" %
142 (error_index, name),
143 error_index=error_index,
144 error_pid=failed_process.pid,
145 exit_code=exitcode,
146 signal_name=name
147 )
148 else:
149 raise ProcessExitedException(
150 "process %d terminated with exit code %d" %
151 (error_index, exitcode),
(...)
154 exit_code=exitcode
155 )
ProcessExitedException: process 11 terminated with signal SIGSEGV
- GPU: None
- available: False
- version: 11.7
- lightning-utilities: 0.5.0
- pytorch-lightning: 1.9.0
- torch: 1.13.1
- torchmetrics: 0.11.0
- absl-py: 1.4.0
- aiohttp: 3.8.3
- aiosignal: 1.3.1
- alabaster: 0.7.12
- alembic: 1.9.2
- anaconda-client: 1.7.2
- anaconda-navigator: 1.10.0
- anaconda-project: 0.8.3
- appdirs: 1.4.4
- argh: 0.26.2
- argon2-cffi: 20.1.0
- asn1crypto: 1.4.0
- astroid: 2.4.2
- astropy: 4.0.2
- asttokens: 2.0.5
- astunparse: 1.6.3
- async-generator: 1.10
- async-timeout: 4.0.2
- atomicwrites: 1.4.0
- attrs: 20.3.0
- autopep8: 1.5.4
- babel: 2.8.1
- backcall: 0.2.0
- backports.functools-lru-cache: 1.6.4
- backports.shutil-get-terminal-size: 1.0.0
- backports.tempfile: 1.0
- backports.weakref: 1.0.post1
- banal: 1.0.6
- beautifulsoup4: 4.9.3
- bio: 1.3.5
- biopython: 1.79
- bitarray: 1.6.1
- bkcharts: 0.2
- bleach: 3.2.1
- bokeh: 2.2.3
- boto: 2.49.0
- bottleneck: 1.3.2
- brotlipy: 0.7.0
- cachetools: 5.2.1
- celluloid: 0.2.0
- certifi: 2020.6.20
- cffi: 1.14.3
- chardet: 3.0.4
- charset-normalizer: 2.1.1
- click: 7.1.2
- clldutils: 3.11.1
- cloudpickle: 1.6.0
- clyent: 1.2.2
- colorama: 0.4.4
- coloredlogs: 15.0.1
- colorlog: 6.6.0
- colormath: 3.0.0
- commonmark: 0.9.1
- conda: 22.9.0
- conda-build: 3.20.5
- conda-package-handling: 1.7.3
- conda-verify: 3.4.2
- contextlib2: 0.6.0.post1
- contourpy: 1.0.7
- cryptography: 3.1.1
- csvw: 2.0.0
- cycler: 0.10.0
- cython: 0.29.21
- cytoolz: 0.11.0
- dask: 2.30.0
- debugpy: 1.6.0
- decorator: 4.4.2
- deeptools: 3.5.1
- deeptoolsintervals: 0.1.9
- defusedxml: 0.6.0
- diff-match-patch: 20200713
- distributed: 2.30.1
- docutils: 0.16
- efficientnet: 1.0.0
- entrypoints: 0.3
- et-xmlfile: 1.0.1
- executing: 0.8.3
- fastcache: 1.1.0
- ffmpeg-python: 0.2.0
- filelock: 3.0.12
- flake8: 3.8.4
- flask: 1.1.2
- flatbuffers: 23.1.4
- fonttools: 4.38.0
- frozenlist: 1.3.3
- fsspec: 2022.11.0
- future: 0.18.2
- gast: 0.4.0
- gevent: 20.9.0
- glob2: 0.7
- gmpy2: 2.0.8
- google-auth: 2.16.0
- google-auth-oauthlib: 0.4.6
- google-pasta: 0.2.0
- graphviz: 0.8.4
- greenlet: 0.4.17
- grpcio: 1.51.1
- gtf2csv: 0.2
- h5py: 2.10.0
- heapdict: 1.0.1
- html5lib: 1.1
- humanfriendly: 10.0
- idna: 2.10
- igv-reports: 1.0.4
- image-classifiers: 1.0.0
- imageio: 2.9.0
- imagesize: 1.2.0
- imgaug: 0.4.0
- importlib-metadata: 4.8.2
- importlib-resources: 5.10.2
- iniconfig: 1.1.1
- intervaltree: 3.1.0
- ipykernel: 6.13.0
- ipympl: 0.9.2
- ipython: 8.3.0
- ipython-genutils: 0.2.0
- ipywidgets: 8.0.4
- isort: 5.6.4
- itsdangerous: 1.1.0
- jdcal: 1.4.1
- jedi: 0.17.1
- jeepney: 0.5.0
- jinja2: 2.11.2
- joblib: 0.17.0
- json5: 0.9.5
- jsonschema: 3.2.0
- jupyter: 1.0.0
- jupyter-client: 7.3.0
- jupyter-console: 6.2.0
- jupyter-core: 4.10.0
- jupyterlab: 2.2.6
- jupyterlab-pygments: 0.1.2
- jupyterlab-server: 1.2.0
- jupyterlab-widgets: 3.0.5
- kaleido: 0.0.3
- keras: 2.11.0
- keras-applications: 1.0.8
- keras-preprocessing: 1.1.2
- keyring: 21.4.0
- kiwisolver: 1.3.0
- latexcodec: 2.0.1
- lazy-object-proxy: 1.4.3
- libarchive-c: 2.9
- libclang: 15.0.6.1
- lightning-utilities: 0.5.0
- lingpy: 2.6.9
- llvmlite: 0.34.0
- locket: 0.2.0
- lxml: 4.6.1
- lzstring: 1.0.4
- mako: 1.2.4
- mamba: 0.7.3
- markdown: 3.3.6
- markupsafe: 1.1.1
- matplotlib: 3.6.3
- matplotlib-inline: 0.1.3
- mccabe: 0.6.1
- mistune: 0.8.4
- mkl-fft: 1.2.0
- mkl-random: 1.1.1
- mkl-service: 2.3.0
- mock: 4.0.2
- more-itertools: 8.6.0
- mpmath: 1.1.0
- msgpack: 1.0.0
- multidict: 6.0.4
- multipledispatch: 0.6.0
- multiqc: 1.11
- mxnet-mkl: 1.6.0
- navigator-updater: 0.2.1
- nbclient: 0.5.1
- nbconvert: 6.0.7
- nbformat: 5.0.8
- ncbi-genome-download: 0.3.1
- neo4j: 4.4.1
- nest-asyncio: 1.4.2
- networkx: 2.6.3
- nibabel: 5.0.0
- nltk: 3.5
- nodejs: 0.1.1
- nose: 1.3.7
- notebook: 6.1.4
- numba: 0.51.2
- numexpr: 2.7.1
- numpy: 1.23.1
- numpydoc: 1.1.0
- nvidia-cublas-cu11: 11.10.3.66
- nvidia-cuda-nvrtc-cu11: 11.7.99
- nvidia-cuda-runtime-cu11: 11.7.99
- nvidia-cudnn-cu11: 8.5.0.96
- oauthlib: 3.2.2
- olefile: 0.46
- opencv-python: 4.7.0.68
- openpyxl: 3.0.5
- opt-einsum: 3.3.0
- optional-django: 0.1.0
- packaging: 20.4
- pandas: 1.1.3
- pandocfilters: 1.4.3
- parso: 0.7.0
- partd: 1.1.0
- path: 15.0.0
- pathlib2: 2.3.5
- pathtools: 0.1.2
- patsy: 0.5.1
- pep8: 1.7.1
- pexpect: 4.8.0
- pickleshare: 0.7.5
- pillow: 8.0.1
- pip: 20.2.4
- pkginfo: 1.6.1
- plotly: 5.9.0
- pluggy: 0.13.1
- ply: 3.11
- prometheus-client: 0.8.0
- prompt-toolkit: 3.0.8
- protobuf: 3.19.6
- psutil: 5.7.2
- ptyprocess: 0.6.0
- pure-eval: 0.2.2
- py: 1.9.0
- py2bit: 0.3.0
- pyasn1: 0.4.8
- pyasn1-modules: 0.2.8
- pybigwig: 0.3.17
- pybtex: 0.24.0
- pycldf: 1.25.1
- pycodestyle: 2.6.0
- pycosat: 0.6.3
- pycparser: 2.20
- pycurl: 7.43.0.6
- pydicom: 2.3.0
- pydocstyle: 5.1.1
- pyflakes: 2.2.0
- pygments: 2.7.2
- pylint: 2.6.0
- pyodbc: 4.0.0-unsupported
- pyopenssl: 19.1.0
- pyparsing: 2.4.7
- pyrsistent: 0.17.3
- pysam: 0.16.0.1
- pysocks: 1.7.1
- pytest: 0.0.0
- python-dateutil: 2.8.2
- python-jsonrpc-server: 0.4.0
- python-language-server: 0.35.1
- pytorch-lightning: 1.9.0
- pytz: 2020.1
- pywavelets: 1.1.1
- pyxdg: 0.27
- pyyaml: 6.0
- pyzmq: 22.3.0
- qdarkstyle: 2.8.1
- qtawesome: 1.0.1
- qtconsole: 4.7.7
- qtpy: 1.9.0
- regex: 2020.10.15
- requests: 2.24.0
- requests-oauthlib: 1.3.1
- retrying: 1.3.3
- rfc3986: 1.5.0
- rich: 10.16.0
- rope: 0.18.0
- rsa: 4.9
- rtree: 0.9.4
- ruamel-yaml: 0.15.87
- scikit-image: 0.17.2
- scikit-learn: 0.23.2
- scipy: 1.5.2
- seaborn: 0.11.0
- secretstorage: 3.1.2
- segmentation-models: 1.0.1
- send2trash: 1.5.0
- setuptools: 50.3.1.post20201107
- shapely: 2.0.0
- simplegeneric: 0.8.1
- simplejson: 3.17.6
- singledispatch: 3.4.0.3
- sip: 4.19.13
- six: 1.16.0
- snowballstemmer: 2.0.0
- sortedcollections: 1.2.1
- sortedcontainers: 2.2.2
- soupsieve: 2.0.1
- spectra: 0.0.11
- sphinx: 3.2.1
- sphinxcontrib-applehelp: 1.0.2
- sphinxcontrib-devhelp: 1.0.2
- sphinxcontrib-htmlhelp: 1.0.3
- sphinxcontrib-jsmath: 1.0.1
- sphinxcontrib-qthelp: 1.0.3
- sphinxcontrib-serializinghtml: 1.1.4
- sphinxcontrib-websupport: 1.2.4
- spyder: 4.1.5
- spyder-kernels: 1.9.4
- sqlalchemy: 1.3.20
- stack-data: 0.2.0
- statsmodels: 0.12.0
- sympy: 1.6.2
- tables: 3.6.1
- tabulate: 0.8.9
- tblib: 1.7.0
- tenacity: 8.0.1
- tensorboard: 2.11.2
- tensorboard-data-server: 0.6.1
- tensorboard-plugin-wit: 1.8.1
- tensorflow: 2.11.0
- tensorflow-estimator: 2.11.0
- tensorflow-io-gcs-filesystem: 0.29.0
- termcolor: 2.2.0
- terminado: 0.9.1
- testpath: 0.4.4
- threadpoolctl: 2.1.0
- tifffile: 2020.10.1
- toml: 0.10.1
- toolz: 0.11.1
- torch: 1.13.1
- torchmetrics: 0.11.0
- tornado: 5.1.1
- tqdm: 4.63.1
- traitlets: 5.1.1
- typing-extensions: 4.4.0
- ujson: 4.0.1
- unicodecsv: 0.14.1
- uritemplate: 4.1.1
- urllib3: 1.25.11
- watchdog: 0.10.3
- wcwidth: 0.2.5
- webencodings: 0.5.1
- werkzeug: 1.0.1
- wheel: 0.35.1
- widgetsnbextension: 4.0.5
- wrapt: 1.11.2
- wurlitzer: 2.0.1
- xlrd: 1.2.0
- xlsxwriter: 1.3.7
- xlwt: 1.3.0
- xmltodict: 0.12.0
- yapf: 0.30.0
- yarl: 1.8.2
- zict: 2.0.0
- zipp: 3.4.0
- zope.event: 4.5.0
- zope.interface: 5.1.2
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.5
- version: How to set hyperparameters search range and run the search? #45~20.04.1-Ubuntu SMP Mon Apr 4 09:38:31 UTC 2022
The text was updated successfully, but these errors were encountered: