Trainer is always using IPEX, even when use_ipex=False #24871

dmsuehir · 2023-07-18T00:44:23Z

System Info

transformers version: 4.32.0.dev0
Platform: Linux-5.15.0-75-generic-x86_64-with-glibc2.35
Python version: 3.10.6
Huggingface_hub version: 0.16.4
Safetensors version: 0.3.1
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.1+cu117 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?: No
Using distributed or parallel set-up in script?: No

Who can help?

@sgugger

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Steps to reproduce the behavior:

The issue can be reproduced with the text-classification example script (other scripts would have the same issue). I have intel-extension-for-pytorch==2.0.100 installed in my environment and am running the following command to run_glue.py without use_ipex (so it should default to False):

export MODEL_NAME=distilbert-base-uncased
export OUTPUT_DIR=/home/dmsuehir/glue_output
export TASK_NAME=mrpc

python run_glue.py \
 --model_name_or_path $MODEL_NAME \
 --task_name $TASK_NAME \
 --do_train \
 --max_seq_length 128 \
 --per_device_train_batch_size 64 \
 --learning_rate 2e-5 \
 --num_train_epochs 1 \
 --no_cuda \
 --output_dir $OUTPUT_DIR \
 --bf16

The train metrics I see with this run are:

***** train metrics *****
  epoch                    =        1.0
  train_loss               =     0.6083
  train_runtime            = 0:00:37.35
  train_samples            =       3668
  train_samples_per_second =     98.191
  train_steps_per_second   =      1.553

Note that we are seeing 98.191 samples/second.

Next try running the same command, except adding on --use_ipex. Note that I am also deleting my output directory between runs.

python run_glue.py \
  --model_name_or_path $MODEL_NAME \
  --task_name $TASK_NAME \
  --do_train \
  --max_seq_length 128 \
  --per_device_train_batch_size 64 \
  --learning_rate 2e-5 \
  --num_train_epochs 1 \
  --no_cuda \
  --output_dir $OUTPUT_DIR \
  --bf16 \
  --use_ipex

I see a similar training metric for train_samples_per_second as step 1:

***** train metrics *****
  epoch                    =        1.0
  train_loss               =     0.6083
  train_runtime            = 0:00:37.94
  train_samples            =       3668
  train_samples_per_second =     96.654
  train_steps_per_second   =      1.528

Finally, I had debugged this issue to look into how IPEX is being used in the Trainer. I found that it can be called in two places: (1) it can get called from the Trainer here or (2) it can get called by accelerate here. The Trainer is properly respecting the use_ipex arg, however, it appears that accelerate is always using IPEX if it's installed. Digging deeper into this, I found that accelerate would only not use IPEX if ACCELERATE_USE_IPEX gets set to False/0. To confirm this, I manually set ACCELERATE_USE_IPEX=0 and then ran the same script/args from step 1:
```
export ACCELERATE_USE_IPEX=0

python run_glue.py \
 --model_name_or_path $MODEL_NAME \
 --task_name $TASK_NAME \
 --do_train \
 --max_seq_length 128 \
 --per_device_train_batch_size 64 \
 --learning_rate 2e-5 \
 --num_train_epochs 1 \
 --no_cuda \
 --output_dir $OUTPUT_DIR \
 --bf16
```
And now I see these training metrics, where we see a drop in train_samples_per_second, which indicates that IPEX has actually been turned off now that the env var was used:
```
***** train metrics *****
  epoch                    =        1.0
  train_loss               =      0.697
  train_runtime            = 0:01:07.74
  train_samples            =       3668
  train_samples_per_second =     54.143
  train_steps_per_second   =      0.856
```

Expected behavior

When use_ipex is not given or set to False, IPEX optimize should not get called.

If it's agreed that this is in fact a bug, I would be happy to work on a PR to fix it. I saw that other accelerate env vars are getting set from training_args.py.

The text was updated successfully, but these errors were encountered:

ydshieh · 2023-07-18T08:18:52Z

cc @muellerzr (right?)

muellerzr · 2023-07-18T11:18:30Z

This is a problem that should be solved in Accelerate, I'll work on a PR today with this. Thanks for the flag!

Edit: actually this can be solved in the training args, PR coming shortly

muellerzr · 2023-07-18T11:25:29Z

@dmsuehir can you try running again with pip install git+https://github.com/huggingface/transformers@muellerzr-ipex and set use_ipex to False? (it's the default)

dmsuehir · 2023-07-18T15:46:27Z

@muellerzr Yes, the fix in your branch works. Thanks!

dmsuehir · 2023-07-18T15:56:44Z

@muellerzr By the way, I think no_cuda and ACCELERATE_USE_CPU may have the same issue, but I don't have a GPU on my machine to verify.

muellerzr self-assigned this Jul 18, 2023

muellerzr mentioned this issue Jul 18, 2023

Disable ipex env var if false #24885

Merged

5 tasks

muellerzr linked a pull request Jul 18, 2023 that will close this issue

Disable ipex env var if false #24885

Merged

5 tasks

muellerzr closed this as completed in #24885 Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Trainer is always using IPEX, even when use_ipex=False #24871

Trainer is always using IPEX, even when use_ipex=False #24871

dmsuehir commented Jul 18, 2023

ydshieh commented Jul 18, 2023

muellerzr commented Jul 18, 2023 •

edited

Loading

muellerzr commented Jul 18, 2023

dmsuehir commented Jul 18, 2023

dmsuehir commented Jul 18, 2023

Trainer is always using IPEX, even when use_ipex=False #24871

Trainer is always using IPEX, even when use_ipex=False #24871

Comments

dmsuehir commented Jul 18, 2023

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ydshieh commented Jul 18, 2023

muellerzr commented Jul 18, 2023 • edited Loading

muellerzr commented Jul 18, 2023

dmsuehir commented Jul 18, 2023

dmsuehir commented Jul 18, 2023

muellerzr commented Jul 18, 2023 •

edited

Loading