Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stuck in trainer.fit() #127

Closed
resurgo97 opened this issue Nov 29, 2021 · 2 comments
Closed

Stuck in trainer.fit() #127

resurgo97 opened this issue Nov 29, 2021 · 2 comments

Comments

@resurgo97
Copy link

python ./openspeech_cli/hydra_train.py dataset=librispeech dataset.dataset_download=False dataset.dataset_path=/home/ubuntu/TEST/libri/ dataset.manifest_file_path=/home/ubuntu/TEST/libri/LibriSpeech/libri_subword_manifest.txt tokenizer=libri_subword model=conformer audio=fbank lr_scheduler=warmup_reduce_lr_on_plateau trainer=gpu criterion=ctc tokenizer.vocab_path=/home/ubuntu/TEST/libri/LibriSpeech trainer.logger=None
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/hydra/core/default_element.py:126: UserWarning: In 'train': Usage of deprecated keyword in package header '# @Package group'.
See https://hydra.cc/docs/next/upgrades/1.0_to_1.1/changes_to_package_header for more information
See {url} for more information"""
audio:
name: fbank
sample_rate: 16000
frame_length: 20.0
frame_shift: 10.0
del_silence: false
num_mels: 80
apply_spec_augment: true
apply_noise_augment: false
apply_time_stretch_augment: false
apply_joining_augment: false
augment:
apply_spec_augment: false
apply_noise_augment: false
apply_joining_augment: false
apply_time_stretch_augment: false
freq_mask_para: 27
freq_mask_num: 2
time_mask_num: 4
noise_dataset_dir: None
noise_level: 0.7
time_stretch_min_rate: 0.7
time_stretch_max_rate: 1.4
dataset:
dataset: librispeech
dataset_path: /home/ubuntu/TEST/libri/
dataset_download: false
manifest_file_path: /home/ubuntu/TEST/libri/LibriSpeech/libri_subword_manifest.txt
criterion:
criterion_name: ctc
reduction: mean
zero_infinity: true
lr_scheduler:
lr: 0.0001
scheduler_name: warmup_reduce_lr_on_plateau
lr_patience: 1
lr_factor: 0.3
peak_lr: 0.0001
init_lr: 1.0e-10
warmup_steps: 4000
model:
model_name: conformer
encoder_dim: 512
num_encoder_layers: 17
num_attention_heads: 8
feed_forward_expansion_factor: 4
conv_expansion_factor: 2
input_dropout_p: 0.1
feed_forward_dropout_p: 0.1
attention_dropout_p: 0.1
conv_dropout_p: 0.1
conv_kernel_size: 31
half_step_residual: true
optimizer: adam
trainer:
seed: 1
accelerator: dp
accumulate_grad_batches: 1
num_workers: 4
batch_size: 32
check_val_every_n_epoch: 1
gradient_clip_val: 5.0
logger: None
max_epochs: 20
save_checkpoint_n_steps: 10000
auto_scale_batch_size: binsearch
sampler: smart
name: gpu
device: gpu
use_cuda: true
auto_select_gpus: true
tokenizer:
sos_token:
eos_token:

pad_token:
blank_token:
encoding: utf-8
unit: libri_subword
vocab_size: 5000
vocab_path: /home/ubuntu/TEST/libri/LibriSpeech

Global seed set to 1
[2021-11-30 04:00:34,822][openspeech.utils][INFO] - audio:
name: fbank
sample_rate: 16000
frame_length: 20.0
frame_shift: 10.0
del_silence: false
num_mels: 80
apply_spec_augment: true
apply_noise_augment: false
apply_time_stretch_augment: false
apply_joining_augment: false
augment:
apply_spec_augment: false
apply_noise_augment: false
apply_joining_augment: false
apply_time_stretch_augment: false
freq_mask_para: 27
freq_mask_num: 2
time_mask_num: 4
noise_dataset_dir: None
noise_level: 0.7
time_stretch_min_rate: 0.7
time_stretch_max_rate: 1.4
dataset:
dataset: librispeech
dataset_path: /home/ubuntu/TEST/libri/
dataset_download: false
manifest_file_path: /home/ubuntu/TEST/libri/LibriSpeech/libri_subword_manifest.txt
criterion:
criterion_name: ctc
reduction: mean
zero_infinity: true
lr_scheduler:
lr: 0.0001
scheduler_name: warmup_reduce_lr_on_plateau
lr_patience: 1
lr_factor: 0.3
peak_lr: 0.0001
init_lr: 1.0e-10
warmup_steps: 4000
model:
model_name: conformer
encoder_dim: 512
num_encoder_layers: 17
num_attention_heads: 8
feed_forward_expansion_factor: 4
conv_expansion_factor: 2
input_dropout_p: 0.1
feed_forward_dropout_p: 0.1
attention_dropout_p: 0.1
conv_dropout_p: 0.1
conv_kernel_size: 31
half_step_residual: true
optimizer: adam
trainer:
seed: 1
accelerator: dp
accumulate_grad_batches: 1
num_workers: 4
batch_size: 32
check_val_every_n_epoch: 1
gradient_clip_val: 5.0
logger: None
max_epochs: 20
save_checkpoint_n_steps: 10000
auto_scale_batch_size: binsearch
sampler: smart
name: gpu
device: gpu
use_cuda: true
auto_select_gpus: true
tokenizer:
sos_token:
eos_token:

pad_token:
blank_token:
encoding: utf-8
unit: libri_subword
vocab_size: 5000
vocab_path: /home/ubuntu/TEST/libri/LibriSpeech

[2021-11-30 04:00:34,841][openspeech.utils][INFO] - Operating System : Linux 4.4.0-176-generic
[2021-11-30 04:00:34,841][openspeech.utils][INFO] - Processor : x86_64
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - device : Tesla V100-SXM2-32GB
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - device : Tesla V100-SXM2-32GB
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - device : Tesla V100-SXM2-32GB
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - CUDA is available : True
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - CUDA version : 10.1
[2021-11-30 04:00:34,844][openspeech.utils][INFO] - PyTorch version : 1.7.1
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/backend/utils.py:54: UserWarning: "sox" backend is being deprecated. The default backend will be changed to "sox_io" backend in 0.8.0 and "sox" backend will be removed in 0.9.0. Please migrate to "sox_io" backend. Please refer to pytorch/audio#903 for the detail.
'"sox" backend is being deprecated. '
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:287: LightningDeprecationWarning: Passing Trainer(accelerator='dp') has been deprecated in v1.5 and will be removed in v1.7. Use Trainer(strategy='dp') instead.
f"Passing Trainer(accelerator={self.distributed_backend!r}) has been deprecated"
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/pytorch_lightning/core/datamodule.py:470: LightningDeprecationWarning: DataModule.setup has already been called, so it will not be called again. In v1.6 this behavior will change to always call DataModule.setup.
f"DataModule.{name} has already been called, so it will not be called again. "
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2]

| Name | Type | Params

0 | criterion | CTCLoss | 0
1 | fc | Linear | 2.6 M
2 | encoder | ConformerEncoder | 114 M

117 M Trainable params
0 Non-trainable params
117 M Total params
469.463 Total estimated model params size (MB)
Validation sanity check: 0it [00:00, ?it/s]<openspeech.data.audio.dataset.SpeechToTextDataset object at 0x7fd0948608d0>
Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py:574: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/SpectralOps.cpp:590.)
fft = torch.rfft(strided_input, 1, normalized=False, onesided=True)
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py:574: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/SpectralOps.cpp:590.)
fft = torch.rfft(strided_input, 1, normalized=False, onesided=True)
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py:574: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/SpectralOps.cpp:590.)
fft = torch.rfft(strided_input, 1, normalized=False, onesided=True)
/home/ubuntu/anaconda3/envs/openspeech/envs/openspeech2/lib/python3.7/site-packages/torchaudio/compliance/kaldi.py:574: UserWarning: The function torch.rfft is deprecated and will be removed in a future PyTorch release. Use the new torch.fft module functions, instead, by importing torch.fft and calling torch.fft.fft or torch.fft.rfft. (Triggered internally at /opt/conda/conda-bld/pytorch_1607370141920/work/aten/src/ATen/native/SpectralOps.cpp:590.)
fft = torch.rfft(strided_input, 1, normalized=False, onesided=True)
Global seed set to 1


여기서 training 단계로 넘어가지 못하고 계속해서 멈춰 있네요.
들여다보니 trainer.fit() 내부에서 문제가 발생한 것 같은데 원인을 모르겠습니다.

I can't proceed to training step and just waiting forever.
It seems that something's got stuck in trainer.fit() but no idea how to solve this.

@resurgo97
Copy link
Author

I see it may have something to do with the data_module (in hydra_train.py)

When I run
for data in data_module.train_dataloader():
print(data)

it's stuck and does not print out anything.
However I still do not understand why it fails to load data.

@resurgo97
Copy link
Author

Finally I found the problem.

In config, trainer.sampler being "smart" is somehow not working.
Switching it to "random" works and training now starts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant