BatchSizeFinder
limits number of validation batches for the whole training process
#18834
Labels
bug
Something isn't working
duplicate
This issue or pull request already exists
tuner
ver: 1.8.x
ver: 2.0.x
Bug description
Using
BatchSizeFinder
seems to limit number of validation batches toBatchSizeFinder._steps_per_trial
.This results in val set being equal to few dozens samples and inadequate metrics being produced.
It seems it can be fixed by calling to
_reset_dataloaders
one additional timeWhat version are you seeing the problem on?
v1.8, v2.0
How to reproduce the bug
Error messages and logs
Here is log that shows a number of validated samples for each epoch.
Val ds size: 123, num epochs: 2, batch size: 2 (see code above)
As you can see first and last runs validated all 123 samples twice, while second run (with default
BatchSizeFinder
) only validated 6 samples on both epochs.Here
6 = steps_per_trial * BATCH_SIZE = 3 * 2
.Environment
Current environment
- GPU:
- NVIDIA GeForce RTX 3050 Laptop GPU
- available: True
- version: 12.1
- lightning: 2.1.0
- lightning-utilities: 0.9.0
- pytorch-lightning: 2.1.0
- torch: 2.1.0
- torchmetrics: 1.2.0
- aiohttp: 3.8.6
- aiosignal: 1.3.1
- async-timeout: 4.0.3
- attrs: 23.1.0
- certifi: 2023.7.22
- charset-normalizer: 3.3.0
- filelock: 3.12.4
- frozenlist: 1.4.0
- fsspec: 2023.9.2
- idna: 3.4
- jinja2: 3.1.2
- lightning: 2.1.0
- lightning-utilities: 0.9.0
- markupsafe: 2.1.3
- mpmath: 1.3.0
- multidict: 6.0.4
- networkx: 3.1
- numpy: 1.24.4
- nvidia-cublas-cu12: 12.1.3.1
- nvidia-cuda-cupti-cu12: 12.1.105
- nvidia-cuda-nvrtc-cu12: 12.1.105
- nvidia-cuda-runtime-cu12: 12.1.105
- nvidia-cudnn-cu12: 8.9.2.26
- nvidia-cufft-cu12: 11.0.2.54
- nvidia-curand-cu12: 10.3.2.106
- nvidia-cusolver-cu12: 11.4.5.107
- nvidia-cusparse-cu12: 12.1.0.106
- nvidia-nccl-cu12: 2.18.1
- nvidia-nvjitlink-cu12: 12.2.140
- nvidia-nvtx-cu12: 12.1.105
- packaging: 23.2
- pip: 23.2.1
- pytorch-lightning: 2.1.0
- pyyaml: 6.0.1
- requests: 2.31.0
- setuptools: 68.1.2
- sympy: 1.12
- torch: 2.1.0
- torchmetrics: 1.2.0
- tqdm: 4.66.1
- triton: 2.1.0
- typing-extensions: 4.8.0
- urllib3: 2.0.7
- wheel: 0.41.2
- yarl: 1.9.2
- OS: Linux
- architecture:
- 64bit
- ELF
- processor: x86_64
- python: 3.8.18
- release: 5.15.0-83-generic
- version: Update trainer.py #92-Ubuntu SMP Mon Aug 14 09:30:42 UTC 2023
More info
@tanaymeh can you maybe add related fix to #18826 ? It seems to be related to the sample parts of code and only require a few additional lines.
The text was updated successfully, but these errors were encountered: