-
Notifications
You must be signed in to change notification settings - Fork 166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Celeba64 training yielding nan during training #2
Comments
Do you get NAN result from the beginning? In my personal operation, due to the GoogleDrive downloading problem, i get the CelebaA-64 separately, and put them in the same folder(./script/celeba_org/celeba), which contains (identity_CelebA.txt then do as the instruction of README about preprocessing process
` No any other change. And at training process:
--num_channels_enc 64 --num_channels_dec 64 --epochs 1000 --num_postprocess_cells 2 --num_preprocess_cells 2 \ --num_latent_scales 3 --num_latent_per_group 20 --num_cell_per_cond_enc 2 --num_cell_per_cond_dec 2 \ --num_preprocess_blocks 1 --num_postprocess_blocks 1 --weight_decay_norm 1e-1 --num_groups_per_scale 20 \ --batch_size 4 --num_nf 1 --ada_groups --num_process_per_node 1 --use_se --res_dist --fast_adamax Besides of the address of the data that I put change to the corresponding address, the only wrong I countered was the parameter --num_process_per_node 1 means the GPU number that available. Here is some provisional result: And for 1024*1024 pic: 🌟 Hoping that my expression can do any help for you and any other counter problems! |
Thanks @Lukelluke! I am glad that everything worked fine for you. @kaushik333, do you get NaN from the very beginning of the training? Or does it happen after a few epochs? We haven't done anything beyond running that command to generate the LMDB datasets. |
Hello @arash-vahdat and @Lukelluke . I get NaN right from the beginning which is very strange. I am now trying what @Lukelluke did i.e. download celebA myself and follow the same instructions (not sure if this will help, but still giving it a shot). |
I haven't seen NaN at the beginning of training, especially given that the learning rate is linearly increased from a very small value, this is unlikely to happen. Most training instability happens after we anneal the learning rate which happens in |
@arash-vahdat @Lukelluke I seemed to have solved the issue. The training now proceeds as expected. Downloading the dataset myself somehow seemed to do the trick (Strange, I know). But anyways thank you very much :) |
I am very glad that it worked. Thanks @Lukelluke for the hint. I will add a link to this issue to help other folks that may run into the same issue. |
It a great pleasure that i can do any help to this great job. Thank you Dr.Vahdat again for your kindness of releasing this official implement ! Hoping that we can get more inspirations from this work which make VAE great again ! |
Hi, @kaushik333 ,when i ran into epoch 5, i countered the same problem of present 'nan' result. and, i remembered one detail on preprocessing Celeba64: i mentioned that i download the dataset separately, and then, to overcome the problem of As you can see, i tried to over across this original check. I'm afraid that this is the source of the problem. I wanna ask for your help, and wanna know how you solve the problem of data preprocessing. Ps. When I tried to retrain from stored ckpt model, with the command of Sincerely, Luke Huang |
Hi Luke, I remember your batch size is very small. We anneal the learning to 1e-2 in 5 epochs. There is a good chance that you are seeing NaN because of the high learning rate for small batch size. I would recommend reducing the learning rate to smaller values such as 5e-3 or even 1e-3. You can increase the frequency of saving checkpoint by modifying this like: Line 120 in feb9937
If you change save_freq to 1, it will save every epoch. |
@Lukelluke are you sure its epoch 5 and not 50? I run it with batch_size 8 and ~5000 iterations is 1 epoch (I know because train_nelbo gets printed only once per epoch). If youre using batch_size 4, I guess you're at 25th epoch? Also, I had to stop my training ~20th epoch. But I never faced this issue. |
Dr. @arash-vahdat : In order to train it on 256x256 images, are there any additional architectural changes or do I need to just alter the flags according to info provided in the paper? |
Thank you so much! Dr.@arash-vahdat , I changed the frequency as you taught, and modify the '--learning_rate 1e-3' , hoping that everything will be ok. Ps. I'm now trying to apply NVAE in audio field, however i countered some special problem:
Hello, @kaushik333 , I reconfirmed that that 'nan' error happened during epoch 5, and i take the advice of Dr.arash-vahdat , modify the init learning rate to 1e-3, hoping that can run successfully after 5 epoch. And congratulate you that run successfully, pity that my GPU can only support 4 batch size, lol As for 256*256, as i mentioned that i tried to run successfully, and i just follow the README, download tfrecord files and preprocess to lmdb data format. and as for command, all due to my gpu limitation, i reduced the And if you need, my pleasure to share pretrained ckpt model. |
Hello Dr. Vahdat,
This is indeed impressive work !!! However I am struggling with the training process.
Using Pytorch 1.6.0 with cuda 10.1
Training using 4 (Not V-100) GPUs of size ~12GB each. Reduced batch size to 8 to fit memory. No other changes apart from this. Followed the instructions exactly as given in Readme. But the training logs show that I am get "nan" losses
Is there any other pre-processing step I need to do for the dataset? Perhaps any other minor detail which you felt was irrelevant to mention in the readme? Any help you can provide is greatly appreciated.
The text was updated successfully, but these errors were encountered: