Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error occurs when trainning transformer-xl by ddp #8494

Closed
ismymajia opened this issue Nov 12, 2020 · 1 comment
Closed

error occurs when trainning transformer-xl by ddp #8494

ismymajia opened this issue Nov 12, 2020 · 1 comment
Labels

Comments

@ismymajia
Copy link

my env is as below:

  • transformers version: 3.4.0
  • Platform: 1Ubuntu-18.04
  • Python version: 3.6.9
  • PyTorch version (GPU?): 1.6.0+cu101 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

I am trainning the transformer-xl on one machine with multi-gpus by ddp.

my script is as below:

python -m torch.distributed.launch --nproc_per_node 4 run_language_modeling.py --output_dir ${model_dir}
--tokenizer_name $data_dir/wordpiece-custom.json
--config_name $data_dir/$config_file
--train_data_files "$data_dir/train*.txt"
--eval_data_file $data_dir/valid.txt
--block_size=128
--do_train
--do_eval
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--learning_rate 6e-4
--weight_decay 0.01
--adam_epsilon 1e-6
--adam_beta1 0.9
--adam_beta2 0.98
--max_steps 500_000
--warmup_steps 24_000
--fp16
--logging_dir ${model_dir}/tensorboard
--save_steps 5000
--save_total_limit 20
--seed 108
--max_steps -1
--num_train_epochs 20
--dataloader_num_workers 0
--overwrite_output_dir

occur error:

[INFO|language_modeling.py:242] 2020-11-11 11:54:46,363 >> Loading features from cached file /opt/ml/input/data/training/kyzhan/huggingface/data/train40G/cached_lm_PreTrainedTokenizerFast_126_train3.txt [took 116.431 s]
/ th_index_copy
main()
File "run_hf_train_lm_ti.py", line 338, in main
trainer.train(model_path=model_path)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 758, in train
tr_loss += self.training_step(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1056, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1082, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 1056, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 888, in forward
word_emb = self.word_emb(input_ids)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 448, in forward
emb_flat.index_copy(0, indices_i, emb_i)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #4 'source' in call to th_index_copy

@TevenLeScao

@stale
Copy link

stale bot commented Jan 16, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jan 16, 2021
@stale stale bot closed this as completed Jan 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant