Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda OOM error when "saving batch" #110

Open
RuntimeRacer opened this issue May 1, 2023 · 20 comments
Open

Cuda OOM error when "saving batch" #110

RuntimeRacer opened this issue May 1, 2023 · 20 comments

Comments

@RuntimeRacer
Copy link
Contributor

RuntimeRacer commented May 1, 2023

Just ran into this midst of training. I assume maybe the epoch ended and it tried to save something to disk, which is only few MB in size however.

2023-05-01 13:03:54,309 INFO [trainer.py:757] Epoch 1, batch 164100, train_loss[loss=2.413, ArTop10Accuracy=0.7735, over 4777.00 frames. ], tot_loss[loss=2.65, ArTop10Accuracy=0.7551, over 5235.90 frames. ], batch size: 17, lr: 8.73e-03
2023-05-01 13:04:19,027 INFO [trainer.py:757] Epoch 1, batch 164200, train_loss[loss=2.595, ArTop10Accuracy=0.7517, over 4929.00 frames. ], tot_loss[loss=2.66, ArTop10Accuracy=0.7534, over 5227.10 frames. ], batch size: 16, lr: 8.72e-03
2023-05-01 13:04:43,635 INFO [trainer.py:757] Epoch 1, batch 164300, train_loss[loss=2.718, ArTop10Accuracy=0.752, over 5319.00 frames. ], tot_loss[loss=2.662, ArTop10Accuracy=0.7525, over 5215.61 frames. ], batch size: 13, lr: 8.72e-03
2023-05-01 13:05:08,249 INFO [trainer.py:757] Epoch 1, batch 164400, train_loss[loss=2.874, ArTop10Accuracy=0.7315, over 5625.00 frames. ], tot_loss[loss=2.665, ArTop10Accuracy=0.7518, over 5210.18 frames. ], batch size: 12, lr: 8.72e-03
2023-05-01 13:05:32,968 INFO [trainer.py:757] Epoch 1, batch 164500, train_loss[loss=2.744, ArTop10Accuracy=0.7321, over 5302.00 frames. ], tot_loss[loss=2.675, ArTop10Accuracy=0.7504, over 5204.26 frames. ], batch size: 13, lr: 8.72e-03
2023-05-01 13:05:58,669 INFO [trainer.py:757] Epoch 1, batch 164600, train_loss[loss=2.714, ArTop10Accuracy=0.7356, over 5771.00 frames. ], tot_loss[loss=2.679, ArTop10Accuracy=0.7497, over 5181.83 frames. ], batch size: 14, lr: 8.71e-03
2023-05-01 13:06:11,613 INFO [trainer.py:1081] Saving batch to exp/valle/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
Traceback (most recent call last):
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1150, in <module>
    main()
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1143, in main
    run(rank=0, world_size=1, args=args)
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1032, in run
    train_one_epoch(
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 669, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 640.00 MiB (GPU 0; 23.69 GiB total capacity; 20.74 GiB already allocated; 517.81 MiB free; 22.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for
Memory Management and PYTORCH_CUDA_ALLOC_CONF

I'll continue with lowering max-duration from 80 to 60 now.

@RuntimeRacer
Copy link
Contributor Author

Can confirm this also happens with max-duration 60 after exactly 164.500 steps again. Probably Epoch switch, which however breaks the process. Not sure what it tries to save there, but it fill up VRAM completely.

2023-05-02 02:17:29,934 INFO [trainer.py:757] Epoch 1, batch 164100, train_loss[loss=2.378, ArTop10Accuracy=0.7807, over 4777.00 frames. ], tot_loss[loss=2.62, ArTop10Accuracy=0.7615, over 5235.90 frames. ], batch size: 17, lr: 6.21e-03
2023-05-02 02:17:54,606 INFO [trainer.py:757] Epoch 1, batch 164200, train_loss[loss=2.57, ArTop10Accuracy=0.7547, over 4929.00 frames. ], tot_loss[loss=2.63, ArTop10Accuracy=0.7598, over 5227.10 frames. ], batch size: 16, lr: 6.21e-03
2023-05-02 02:18:19,167 INFO [trainer.py:757] Epoch 1, batch 164300, train_loss[loss=2.688, ArTop10Accuracy=0.7569, over 5319.00 frames. ], tot_loss[loss=2.632, ArTop10Accuracy=0.7586, over 5215.61 frames. ], batch size: 13, lr: 6.21e-03
2023-05-02 02:18:43,727 INFO [trainer.py:757] Epoch 1, batch 164400, train_loss[loss=2.85, ArTop10Accuracy=0.7406, over 5625.00 frames. ], tot_loss[loss=2.635, ArTop10Accuracy=0.758, over 5210.18 frames. ], batch size: 12, lr: 6.21e-03
2023-05-02 02:19:08,370 INFO [trainer.py:757] Epoch 1, batch 164500, train_loss[loss=2.717, ArTop10Accuracy=0.737, over 5302.00 frames. ], tot_loss[loss=2.644, ArTop10Accuracy=0.7566, over 5204.26 frames. ], batch size: 13, lr: 6.21e-03
2023-05-02 02:19:19,203 INFO [trainer.py:1081] Saving batch to exp/valle/batch-bdd640fb-0667-1ad1-1c80-317fa3b1799d.pt
Traceback (most recent call last):
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1150, in <module>
    main()
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1143, in main
    run(rank=0, world_size=1, args=args)
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 1032, in run
    train_one_epoch(
  File "/workspace/kajispeech-v2/vall-e/egs/commonvoice/bin/trainer.py", line 669, in train_one_epoch
    scaler.scale(loss).backward()
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/runtimeracer/anaconda3/envs/kajispeech2/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 580.00 MiB (GPU 0; 23.69 GiB total capacity; 20.86 GiB already allocated; 515.81 MiB free; 22.05 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@RuntimeRacer
Copy link
Contributor Author

Might also be related to the Issue I mentioned here in combination with Japanese / Chinese symbols, since my dataset contains these: #94 (comment)

I'll try another iteration on the model later without these languages to see if that makes any difference.

@nivibilla
Copy link

I think Vall-E X is for multi language support . Not sure if Vall E can learn multiple languages

@RuntimeRacer
Copy link
Contributor Author

From my testing yesterday it was able to transfer dialect from input sample to output even if it's a different language.
What Vall-E X does is actually getting rid of dialect, or rather, converting the dialect from source language to target language.

The Issue with OOM I believe comes from the model implementation not being able to handle the symbols. I've seen there is this PR for adding chinese language support with a giant wordlist and G2PBackend, which I believe is needed to convert the symbol language to words that can be properly phonemized internally.

Because yesterday I tried Inference with sth. like this:

  1. 这是一个中文句子 ("this is a chinese Sentence") -> Gets OOM error
  2. Zhè shì yīgè zhōngwén jùzi ("this is a chinese Sentence", latin representation as simpler, phonemizable words) -> Works

I believe this is what G2PBackend actually does internally, but I could be mistaken.

@nivibilla
Copy link

Ah I see. Makes sense.

@RahulBhalley
Copy link

From my testing yesterday it was able to transfer dialect from input sample to output even if it's a different language.

Do you mean we can train it on multiple languages already?

What Vall-E X does is actually getting rid of dialect, or rather, converting the dialect from source language to target language.

Let me know if I get it right. Does this mean that input is lang A and output is lang B? I thought that the language id would control the accent instead as in "Learning to Speak Foreign Language Fluently".

@RuntimeRacer
Copy link
Contributor Author

RuntimeRacer commented May 4, 2023

Do you mean we can train it on multiple languages already?

I have a PR open for the Commonvoice Dataset and it is currently training on 24 different languages in my AI training machine. Unfortunately I still cannot be sure it trains on the full dataset, because I hit this OOM error after ~164.500 steps each time; however I implemented some code and hopefully it is fixed next time it hits the issue.

Let me know if I get it right. Does this mean that input is lang A and output is lang B? I thought that the language id would control the accent instead as in "Learning to Speak Foreign Language Fluently".

I might understood that wrong as well. Just revisited their Github Page, I think they actually can control the accent explicitly. Just in most examples they just switch from english to chinese and backwards, so it's hard to assume if it could for example do french with german accent very well. It probably can.

@RuntimeRacer
Copy link
Contributor Author

Further debugging the issue revealed that the crash seems to happen frequently if there are Cyrillic letters in the batch; processing these takes considerably more time and VRAM.
It's less broken than with Chinese / Japanese symbols, but it behaves similar. So I believe the model also needs a phonetic conversion backend for these. Not sure in the first place though, why it has these issues especially with cyrillic letters, since those are just another alphabet not too different from latin.
I will however strip all languages using a non-latin alphabet from my training dataset now, and see if that fixes the issue.

Also going to provide a PR with exception handling code later, which provides some verbosity output if the error is hit and (tries to) skip broken batches and continue training.

@nivibilla
Copy link

@RuntimeRacer do you think it's useful to preprocess the data into phonemes and then give that to vall e. I feel like this would solve a lot of the OOM errors

@RuntimeRacer
Copy link
Contributor Author

@nivibilla I did follow the exact process of dataset preparation for my CommonVoice training; I assume the phoneme conversion has already happened; also it apparently IS able to process these letters and symbols, but the performance in generation as well as the memory footprint is incredibly worse compared to latin.

Elle est toujours utilisée par Réseau ferré italien pour le service de l'infrastructure. -> This takes ~2 seconds in inference and 6.3 GB of VRAM
Әмма уңай тәэсирләр алып килүче мәгълүмат белән бергә, негатив хәбәрләр дә таратыла. -> This takes ~20 seconds in inference and 15 GB of VRAM

@nivibilla
Copy link

Ah right, in inference we are using a lot of vram for phoneme conversion? That's strange.

@nivibilla
Copy link

Im finding it difficult to understand why there is a vram difference when using different languages. When converting to phonemes why is there a difference in VRAM usage? I assume after converting, there should be no difference what language it is. Unless the converted phonemes are much longer when its not English?

@RuntimeRacer
Copy link
Contributor Author

I can share my symbols file later. But I don't think these are very different; I had a look at them before starting training

@RuntimeRacer
Copy link
Contributor Author

Almost forgot I wanted to share my symbols file from phonemization:

<eps> 0
! 1
" 2
( 3
) 4
, 5
. 6
1 7
: 8
; 9
? 10
_ 11
a 12
aɪ 13
aɪə 14
aɪɚ 15
aʊ 16
b 17
bn 18
d 19
dʑ 20
dʒ 21
e 22
enus 23
eɪ 24
f 25
h 26
hi 27
hy 28
i 29
iə 30
iː 31
iːː 32
j 33
k 34
kh 35
ko 36
l 37
m 38
n 39
nʲ 40
o 41
oʊ 42
oː 43
oːɹ 44
p 45
pa 46
q 47
r 48
s 49
t 50
tw 51
tɕ 52
tʃ 53
tʰ 54
u 55
uː 56
v 57
w 58
x 59
z 60
¡ 61
« 62
» 63
¿ 64
æ 65
ææ 66
ç 67
ð 68
ŋ 69
ɐ 70
ɐɐ 71
ɑ 72
ɑː 73
ɑːɹ 74
ɒ 75
ɔ 76
ɔɪ 77
ɔː 78
ɔːɹ 79
ə 80
əl 81
əʊ 82
ɚ 83
ɛ 84
ɛɹ 85
ɛː 86
ɜː 87
ɡ 88
ɡʰ 89
ɡʲ 90
ɪ 91
ɪɹ 92
ɪː 93
ɬ 94
ɯ 95
ɹ 96
ɾ 97
ʁ 98
ʃ 99
ʊ 100
ʊɹ 101
ʌ 102
ʌʌ 103
ʒ 104
ʔ 105
̃ 106
̩ 107
θ 108
ᵻ 109
— 110
“ 111
” 112
… 113

@lifeiteng
Copy link
Owner

try lower --filter-max-duration 20(default value) to --filter-max-duration 14, you can use python ./bin/display_manifest_statistics.py to get duration distribution.

@RuntimeRacer
Copy link
Contributor Author

@lifeiteng I believe it is most likely a charset issue following my observations: #110 (comment)
Synthesis in English, french and German seems to work however, so it's not an issue of multi-language training in general I believe.

Duration Disribution for CV with 24 languages (including languages with Cyrillic, Chinese and Japanese Charset) I shared in my commit here: https://github.com/lifeiteng/vall-e/pull/111/files#diff-aaf4d0ff4603a6956d6a4834fd5df31c65f62e95cee609f435828504c31a82fa

I will share my intermediate training model to allow further testing once Commonvoice epoch 1 finshed.

@chenjiasheng
Copy link
Collaborator

chenjiasheng commented May 14, 2023

Have you increased the macro NUM_TEXT_TOKENS?
Token ids larger than this macro will cause out of bound memory access.
@lifeiteng How about making this macro configurable from commad line args, leaving its default value to 512.

NUM_TEXT_TOKENS = 512

@yonomitt
Copy link

yonomitt commented Jun 7, 2023

Further debugging the issue revealed that the crash seems to happen frequently if there are Cyrillic letters in the batch; processing these takes considerably more time and VRAM. It's less broken than with Chinese / Japanese symbols, but it behaves similar. So I believe the model also needs a phonetic conversion backend for these. Not sure in the first place though, why it has these issues especially with cyrillic letters, since those are just another alphabet not too different from latin. I will however strip all languages using a non-latin alphabet from my training dataset now, and see if that fixes the issue.

Also going to provide a PR with exception handling code later, which provides some verbosity output if the error is hit and (tries to) skip broken batches and continue training.

I'm running into this issue, as well. I thought I had stripped out the non-latin alphabet characters from my dataset, but I still run into the issue. It passes the:

Sanity check -- see if any of the batches in epoch 1 would cause OOM.

But then fails on a specific batch.

Are you also stripping punctuation? What are you doing to filter out the non-latin alphabet characters?

@RuntimeRacer
Copy link
Contributor Author

RuntimeRacer commented Jun 13, 2023

Have you increased the macro NUM_TEXT_TOKENS? Token ids larger than this macro will cause out of bound memory access. @lifeiteng How about making this macro configurable from commad line args, leaving its default value to 512.

NUM_TEXT_TOKENS = 512

So I tested this again. I tried with Value 1024 and also 4096 now, but each time it starts breaking the training as soon as the first cyrillic sentence appears. I believe this is some encoding related issue.

EDIT: I will check if I can somehow apply this while reading in the datasets: https://pypi.org/project/anyascii/0.1.6/

@yonomitt
Copy link

@RuntimeRacer Can you post how the text_tokens_lens and audio_features_lens compare for a Cyrillic sentence?

I think my OOMs were due to text that was way too long compared to the audio. So the text_tokens_lens was way longer than it should have been. To fix this, I've been doing a filter pass, which removes any data where:

(audio_features_lens / text_tokens_lens) < 1.0

Most good (English) data that I've spot checked seems to have a ratio around 6.0-6.5, but I've seen it as low as 4.0, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants