error occurs when trainning transformer-xl by ddp #8494

ismymajia · 2020-11-12T12:16:58Z

my env is as below:

transformers version: 3.4.0
Platform: 1Ubuntu-18.04
Python version: 3.6.9
PyTorch version (GPU?): 1.6.0+cu101 (False)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?:
Using distributed or parallel set-up in script?:

I am trainning the transformer-xl on one machine with multi-gpus by ddp.

my script is as below:

python -m torch.distributed.launch --nproc_per_node 4 run_language_modeling.py --output_dir ${model_dir}
--tokenizer_name $data_dir/wordpiece-custom.json
--config_name $data_dir/$config_file
--train_data_files "$data_dir/train*.txt"
--eval_data_file $data_dir/valid.txt
--block_size=128
--do_train
--do_eval
--per_device_train_batch_size 1
--gradient_accumulation_steps 1
--learning_rate 6e-4
--weight_decay 0.01
--adam_epsilon 1e-6
--adam_beta1 0.9
--adam_beta2 0.98
--max_steps 500_000
--warmup_steps 24_000
--fp16
--logging_dir ${model_dir}/tensorboard
--save_steps 5000
--save_total_limit 20
--seed 108
--max_steps -1
--num_train_epochs 20
--dataloader_num_workers 0
--overwrite_output_dir

occur error:

[INFO|language_modeling.py:242] 2020-11-11 11:54:46,363 >> Loading features from cached file /opt/ml/input/data/training/kyzhan/huggingface/data/train40G/cached_lm_PreTrainedTokenizerFast_126_train3.txt [took 116.431 s]
/ th_index_copy
main()
File "run_hf_train_lm_ti.py", line 338, in main
trainer.train(model_path=model_path)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 758, in train
tr_loss += self.training_step(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1056, in training_step
loss = self.compute_loss(model, inputs)
File "/usr/local/lib/python3.6/dist-packages/transformers/trainer.py", line 1082, in compute_loss
outputs = model(**inputs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 511, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 1056, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 888, in forward
word_emb = self.word_emb(input_ids)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 722, in call_impl
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/transformers/modeling_transfo_xl.py", line 448, in forward
emb_flat.index_copy(0, indices_i, emb_i)
RuntimeError: Expected object of scalar type Float but got scalar type Half for argument #4 'source' in call to th_index_copy

@TevenLeScao

The text was updated successfully, but these errors were encountered:

stale · 2021-01-16T12:02:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the wontfix label Jan 16, 2021

stale bot closed this as completed Jan 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error occurs when trainning transformer-xl by ddp #8494

error occurs when trainning transformer-xl by ddp #8494

ismymajia commented Nov 12, 2020

stale bot commented Jan 16, 2021

error occurs when trainning transformer-xl by ddp #8494

error occurs when trainning transformer-xl by ddp #8494

Comments

ismymajia commented Nov 12, 2020

stale bot commented Jan 16, 2021