Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trainer is always using IPEX, even when use_ipex=False #24871

Closed
4 tasks done
dmsuehir opened this issue Jul 18, 2023 · 5 comments · Fixed by #24885
Closed
4 tasks done

Trainer is always using IPEX, even when use_ipex=False #24871

dmsuehir opened this issue Jul 18, 2023 · 5 comments · Fixed by #24885
Assignees

Comments

@dmsuehir
Copy link
Contributor

System Info

  • transformers version: 4.32.0.dev0
  • Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.35
  • Python version: 3.10.6
  • Huggingface_hub version: 0.16.4
  • Safetensors version: 0.3.1
  • Accelerate version: 0.21.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.0.1+cu117 (False)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: No
  • Using distributed or parallel set-up in script?: No

Who can help?

@sgugger

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

  1. The issue can be reproduced with the text-classification example script (other scripts would have the same issue). I have intel-extension-for-pytorch==2.0.100 installed in my environment and am running the following command to run_glue.py without use_ipex (so it should default to False):
    export MODEL_NAME=distilbert-base-uncased
    export OUTPUT_DIR=/home/dmsuehir/glue_output
    export TASK_NAME=mrpc
    
    python run_glue.py \
     --model_name_or_path $MODEL_NAME \
     --task_name $TASK_NAME \
     --do_train \
     --max_seq_length 128 \
     --per_device_train_batch_size 64 \
     --learning_rate 2e-5 \
     --num_train_epochs 1 \
     --no_cuda \
     --output_dir $OUTPUT_DIR \
     --bf16
    
    The train metrics I see with this run are:
    ***** train metrics *****
      epoch                    =        1.0
      train_loss               =     0.6083
      train_runtime            = 0:00:37.35
      train_samples            =       3668
      train_samples_per_second =     98.191
      train_steps_per_second   =      1.553
    
    Note that we are seeing 98.191 samples/second.
  2. Next try running the same command, except adding on --use_ipex. Note that I am also deleting my output directory between runs.
    python run_glue.py \
      --model_name_or_path $MODEL_NAME \
      --task_name $TASK_NAME \
      --do_train \
      --max_seq_length 128 \
      --per_device_train_batch_size 64 \
      --learning_rate 2e-5 \
      --num_train_epochs 1 \
      --no_cuda \
      --output_dir $OUTPUT_DIR \
      --bf16 \
      --use_ipex
    
    I see a similar training metric for train_samples_per_second as step 1:
    ***** train metrics *****
      epoch                    =        1.0
      train_loss               =     0.6083
      train_runtime            = 0:00:37.94
      train_samples            =       3668
      train_samples_per_second =     96.654
      train_steps_per_second   =      1.528
    
  3. Finally, I had debugged this issue to look into how IPEX is being used in the Trainer. I found that it can be called in two places: (1) it can get called from the Trainer here or (2) it can get called by accelerate here. The Trainer is properly respecting the use_ipex arg, however, it appears that accelerate is always using IPEX if it's installed. Digging deeper into this, I found that accelerate would only not use IPEX if ACCELERATE_USE_IPEX gets set to False/0. To confirm this, I manually set ACCELERATE_USE_IPEX=0 and then ran the same script/args from step 1:
    export ACCELERATE_USE_IPEX=0
    
    python run_glue.py \
     --model_name_or_path $MODEL_NAME \
     --task_name $TASK_NAME \
     --do_train \
     --max_seq_length 128 \
     --per_device_train_batch_size 64 \
     --learning_rate 2e-5 \
     --num_train_epochs 1 \
     --no_cuda \
     --output_dir $OUTPUT_DIR \
     --bf16
    
    And now I see these training metrics, where we see a drop in train_samples_per_second, which indicates that IPEX has actually been turned off now that the env var was used:
    ***** train metrics *****
      epoch                    =        1.0
      train_loss               =      0.697
      train_runtime            = 0:01:07.74
      train_samples            =       3668
      train_samples_per_second =     54.143
      train_steps_per_second   =      0.856
    

Expected behavior

When use_ipex is not given or set to False, IPEX optimize should not get called.

If it's agreed that this is in fact a bug, I would be happy to work on a PR to fix it. I saw that other accelerate env vars are getting set from training_args.py.

@ydshieh
Copy link
Collaborator

ydshieh commented Jul 18, 2023

cc @muellerzr (right?)

@muellerzr
Copy link
Contributor

muellerzr commented Jul 18, 2023

This is a problem that should be solved in Accelerate, I'll work on a PR today with this. Thanks for the flag!

Edit: actually this can be solved in the training args, PR coming shortly

@muellerzr muellerzr self-assigned this Jul 18, 2023
@muellerzr
Copy link
Contributor

@dmsuehir can you try running again with pip install git+https://github.com/huggingface/transformers@muellerzr-ipex and set use_ipex to False? (it's the default)

@muellerzr muellerzr linked a pull request Jul 18, 2023 that will close this issue
5 tasks
@dmsuehir
Copy link
Contributor Author

@muellerzr Yes, the fix in your branch works. Thanks!

@dmsuehir
Copy link
Contributor Author

@muellerzr By the way, I think no_cuda and ACCELERATE_USE_CPU may have the same issue, but I don't have a GPU on my machine to verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants