profiling ops on xpu #2249

songhappy · 2025-01-10T19:44:22Z

Context

What is the purpose of this PR? Is it to

add a new feature

Please link to any issues this PR addresses.
https://jira.devtools.intel.com/browse/IPB-2875

Changelog

What are the changes made in this PR?
Added 'xpu' in _profiler.py and modify cuda related memory profiler to cuda only in driver scripts in receipts directory.
*

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

manually run any new or modified recipes with sufficient proof of correctness
Steps:

download Llama-3.2-3B-Instruct model
modify recipes/configs/llama3_2/3B_full_single_device.yaml, change "device: xpu", "profiler.enabled:True"
tune run full_finetune_single_device --config recipes/configs/llama3_2/3B_full_single_device.yaml
see profiling results under /tmp/full-llama3.2-finetune/profiling_results

{
  "schemaVersion": 1,
  "deviceProperties": [
  ],
  "with_flops": 1,
  "record_shapes": 1,
  "profile_memory": 1,
  "with_stack": 1,
  "trace_id": "1DB69D6280304432870B411620032DC3",
  "traceEvents": [
  {
    "ph": "X", "cat": "cpu_op", "name": "aten::conv2d", "pid": 1341316, "tid": 1
341316,
    "ts": 731713074715.025, "dur": 95265.387,
    "args": {
      "External id": 1,"Record function id": 0, "Concrete Inputs": ["", "", "",
"[2, 2]", "[3, 3]", "[1, 1]", "1"], "Input type": ["float", "float", "", "Scalar
List", "ScalarList", "ScalarList", "Scalar"], "Input Strides": [[150528, 50176,
224, 1], [147, 49, 7, 1], [], [], [], [], []], "Input Dims": [[32, 3, 224, 224],
 [64, 3, 7, 7], [], [], [], [], []], "Ev Idx": 0
    }
  },

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API

pytorch-bot · 2025-01-10T19:44:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2249

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Cancelled Jobs

As of commit dd7809f with merge base 27fd3a1 ():

NEW FAILURES - The following jobs have failed:

GPU tests / gpu_test (3.11, stable) (gh)
Process completed with exit code 1.
Unit Test / unit_tests (3.10) (gh)
Process completed with exit code 1.

CANCELLED JOBS - The following jobs were cancelled. Please retry:

GPU tests / gpu_test (3.10, stable) (gh)
Process completed with exit code 1.
GPU tests / gpu_test (3.9, stable) (gh)
##[error]The operation was canceled.
Unit Test / unit_tests (3.11) (gh)
##[error]The operation was canceled.
Unit Test / unit_tests (3.9) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

songhappy · 2025-01-17T19:18:01Z

@SalmanMohammadi Could you please review and approve it?

felipemello1 · 2025-01-17T20:18:13Z

hey @songhappy , thanks for the PR!

Just two questions before i approve it:

i saw that you added and self._device.type == "cuda" to some recipes, but not all. Is it because the other recipes already have it, or did we forget some?
In your testing, i dont think that you set profiler.profile_memory=True. Do you think its worth checking this option to make sure its working for xpu?

profiling ops on xpu

c3cf0f5

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 10, 2025

update

dd7809f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

profiling ops on xpu #2249

profiling ops on xpu #2249

songhappy commented Jan 10, 2025

pytorch-bot bot commented Jan 10, 2025 •

edited

Loading

songhappy commented Jan 17, 2025

felipemello1 commented Jan 17, 2025

profiling ops on xpu #2249

Are you sure you want to change the base?

profiling ops on xpu #2249

Conversation

songhappy commented Jan 10, 2025

Context

Changelog

Test plan

UX

pytorch-bot bot commented Jan 10, 2025 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2249

❌ 2 New Failures, 4 Cancelled Jobs

songhappy commented Jan 17, 2025

felipemello1 commented Jan 17, 2025

pytorch-bot bot commented Jan 10, 2025 •

edited

Loading