Stuck in trainer.fit() #127

resurgo97 · 2021-11-29T19:12:02Z

python ./openspeech_cli/hydra_train.py dataset=librispeech dataset.dataset_download=False dataset.dataset_path=/home/ubuntu/TEST/libri/ dataset.manifest_file_path=/home/ubuntu/TEST/libri/LibriSpeech/libri_subword_manifest.txt tokenizer=libri_subword model=conformer audio=fbank lr_scheduler=warmup_reduce_lr_on_plateau trainer=gpu criterion=ctc tokenizer.vocab_path=/home/ubuntu/TEST/libri/LibriSpeech trainer.logger=None
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/hydra/core/default_element.py:126: UserWarning: In 'train': Usage of deprecated keyword in package header '# @Package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
See {url} for more information"""
audio:
name: fbank
sample_rate: 16000
frame_length: 20.0
frame_shift: 10.0
del_silence: false
num_mels: 80
apply_spec_augment: true
apply_noise_augment: false
apply_time_stretch_augment: false
apply_joining_augment: false
augment:
apply_spec_augment: false
apply_noise_augment: false
apply_joining_augment: false
apply_time_stretch_augment: false
freq_mask_para: 27
freq_mask_num: 2
time_mask_num: 4
noise_dataset_dir: None
noise_level: 0.7
time_stretch_min_rate: 0.7
time_stretch_max_rate: 1.4
dataset:
dataset: librispeech
dataset_path: /home/ubuntu/TEST/libri/
dataset_download: false
manifest_file_path: /home/ubuntu/TEST/libri/LibriSpeech/libri_subword_manifest.txt
criterion:
criterion_name: ctc
reduction: mean
zero_infinity: true
lr_scheduler:
lr: 0.0001
scheduler_name: warmup_reduce_lr_on_plateau
lr_patience: 1
lr_factor: 0.3
peak_lr: 0.0001
init_lr: 1.0e-10
warmup_steps: 4000
model:
model_name: conformer
encoder_dim: 512
num_encoder_layers: 17
num_attention_heads: 8
feed_forward_expansion_factor: 4
conv_expansion_factor: 2
input_dropout_p: 0.1
feed_forward_dropout_p: 0.1
attention_dropout_p: 0.1
conv_dropout_p: 0.1
conv_kernel_size: 31
half_step_residual: true
optimizer: adam
trainer:
seed: 1
accelerator: dp
accumulate_grad_batches: 1
num_workers: 4
batch_size: 32
check_val_every_n_epoch: 1
gradient_clip_val: 5.0
logger: None
max_epochs: 20
save_checkpoint_n_steps: 10000
auto_scale_batch_size: binsearch
sampler: smart
name: gpu
device: gpu
use_cuda: true
auto_select_gpus: true
tokenizer:
sos_token:
~~eos_token:~~
pad_token:
blank_token:
encoding: utf-8
unit: libri_subword
vocab_size: 5000
vocab_path: /home/ubuntu/TEST/libri/LibriSpeech

Global seed set to 1
[2021-11-30 04:00:34,822][openspeech.utils][INFO] - audio:
name: fbank
sample_rate: 16000
frame_length: 20.0
frame_shift: 10.0
del_silence: false
num_mels: 80
apply_spec_augment: true
apply_noise_augment: false
apply_time_stretch_augment: false
apply_joining_augment: false
augment:
apply_spec_augment: false
apply_noise_augment: false
apply_joining_augment: false
apply_time_stretch_augment: false
freq_mask_para: 27
freq_mask_num: 2
time_mask_num: 4
noise_dataset_dir: None
noise_level: 0.7
time_stretch_min_rate: 0.7
time_stretch_max_rate: 1.4
dataset:
dataset: librispeech
dataset_path: /home/ubuntu/TEST/libri/
dataset_download: false
manifest_file_path: /home/ubuntu/TEST/libri/LibriSpeech/libri_subword_manifest.txt
criterion:
criterion_name: ctc
reduction: mean
zero_infinity: true
lr_scheduler:
lr: 0.0001
scheduler_name: warmup_reduce_lr_on_plateau
lr_patience: 1
lr_factor: 0.3
peak_lr: 0.0001
init_lr: 1.0e-10
warmup_steps: 4000
model:
model_name: conformer
encoder_dim: 512
num_encoder_layers: 17
num_attention_heads: 8
feed_forward_expansion_factor: 4
conv_expansion_factor: 2
input_dropout_p: 0.1
feed_forward_dropout_p: 0.1
attention_dropout_p: 0.1
conv_dropout_p: 0.1
conv_kernel_size: 31
half_step_residual: true
optimizer: adam
trainer:
seed: 1
accelerator: dp
accumulate_grad_batches: 1
num_workers: 4
batch_size: 32
check_val_every_n_epoch: 1
gradient_clip_val: 5.0
logger: None
max_epochs: 20
save_checkpoint_n_steps: 10000
auto_scale_batch_size: binsearch
sampler: smart
name: gpu
device: gpu
use_cuda: true
auto_select_gpus: true
tokenizer:
sos_token:
~~eos_token:~~
pad_token:
blank_token:
encoding: utf-8
unit: libri_subword
vocab_size: 5000
vocab_path: /home/ubuntu/TEST/libri/LibriSpeech

[2021-11-30 04:00:34,841][openspeech.utils][INFO] - Operating System : Linux 4.4.0-176-generic
[2021-11-30 04:00:34,841][openspeech.utils][INFO] - Processor : x86_64
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - device : Tesla V100-SXM2-32GB
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - device : Tesla V100-SXM2-32GB
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - device : Tesla V100-SXM2-32GB
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - CUDA is available : True
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - CUDA version : 10.1
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - PyTorch version : 1.7.1
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to pytorch/audio#903 for the detail.
'"sox" backend is being deprecated. '
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing Trainer(accelerator='dp') has been deprecated in v1.5 and will be removed in v1.7. Use Trainer(strategy='dp') instead.
f"Passing Trainer(accelerator={self.distributed_backend!r}) has been deprecated"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py:470: LightningDeprecationWarning: DataModule.setup has already been called, so it will not be called again. In v1.6 this behavior will change to always call DataModule.setup.
f"DataModule.{name} has already been called, so it will not be called again. "
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

| Name | Type | Params

0 | criterion | CTCLoss | 0
1 | fc | Linear | 2.6 M
2 | encoder | ConformerEncoder | 114 M

117 M Trainable params
0 Non-trainable params
117 M Total params
469.463 Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]<openspeech.data.audio.dataset.SpeechToTextDataset object at 0x7fd0948608d0>
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py:574: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/SpectralOps.cpp:590.)
fft = torch.rfft(strided_input, 1, normalized=False, onesided=True)
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py:574: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/SpectralOps.cpp:590.)
fft = torch.rfft(strided_input, 1, normalized=False, onesided=True)
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py:574: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/SpectralOps.cpp:590.)
fft = torch.rfft(strided_input, 1, normalized=False, onesided=True)
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py:574: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/SpectralOps.cpp:590.)
fft = torch.rfft(strided_input, 1, normalized=False, onesided=True)
Global seed set to 1

여기서 training 단계로 넘어가지 못하고 계속해서 멈춰 있네요.
들여다보니 trainer.fit() 내부에서 문제가 발생한 것 같은데 원인을 모르겠습니다.

I can't proceed to training step and just waiting forever.
It seems that something's got stuck in trainer.fit() but no idea how to solve this.

The text was updated successfully, but these errors were encountered:

resurgo97 · 2021-11-29T19:58:11Z

I see it may have something to do with the data_module (in hydra_train.py)

When I run
for data in data_module.train_dataloader():
print(data)

it's stuck and does not print out anything.
However I still do not understand why it fails to load data.

resurgo97 · 2021-11-29T20:12:10Z

Finally I found the problem.

In config, trainer.sampler being "smart" is somehow not working.
Switching it to "random" works and training now starts.

resurgo97 closed this as completed Nov 29, 2021

resurgo97 mentioned this issue Dec 4, 2021

Expected period of time hanging(or pre-processing) in the beginning of the training #102

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stuck in trainer.fit() #127

Stuck in trainer.fit() #127

resurgo97 commented Nov 29, 2021

resurgo97 commented Nov 29, 2021

resurgo97 commented Nov 29, 2021

Stuck in trainer.fit() #127

Stuck in trainer.fit() #127

Comments

resurgo97 commented Nov 29, 2021

| Name | Type | Params

0 | criterion | CTCLoss | 0 1 | fc | Linear | 2.6 M 2 | encoder | ConformerEncoder | 114 M

resurgo97 commented Nov 29, 2021

resurgo97 commented Nov 29, 2021

0 | criterion | CTCLoss | 0
1 | fc | Linear | 2.6 M
2 | encoder | ConformerEncoder | 114 M