-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ROCm SDXL testing and benchmarking. #17183
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For future work in this repo, please either use a username-prefixed branch name or a fork of the repo. Unprefixed branch names in the main repo are (should be) reserved for long-lived, shared branches.
Also nit: please capitalize the PR title and use sentence style like Add ROCm SDXL testing and benchmarking.
linux_x86_64_rocm_models: | ||
name: Linux (x86_64) AMDGPU Rocm Model Testing + Benchmark |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put MI250 in these names.
Consider these names
PkgCI / Regression Test AMDGPU-Vulkan / Linux (x86_64) Model Testing
PkgCI / Regression Test AMDGPU-ROCm / Linux (x86_64) AMDGPU Rocm Model Testing + Benchmar
... and on the 'run logs' page:
The names are long and have some redundant information.
How about Linux MI250 - Models
here, for PkgCI / Regression Test AMDGPU-ROCm / Linux MI250 - Models
. Or just MI250 - Models
"config_name": "gpu_rocm", | ||
"iree_compile_flags" : [ | ||
"--iree-hal-target-backends=rocm", | ||
"--iree-rocm-target-chip=gfx90a", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ROCm compilation is tied to a specific chip, so we should call out that chip (either by identifier like gfx90a
, or by name like MI250
). That should be reflected in both the config_name
and the file name. If we run model tests on w7900, that will use a different chip (and thus config file).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, multiple machines have same chip name, so will go with the chip
"--iree-opt-const-eval=false", | ||
"--iree-codegen-transform-dialect-library=attention_and_matmul_spec.mlir" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a good time to test both default flags and some "additional flags" (can choose a better name for that).
If default flags fail compilation, that's good to test with XFAIL.
-n 4 \ | ||
-rpfE \ | ||
-k real_weights \ | ||
--no-skip-tests-missing-files \ | ||
--capture=no \ | ||
--timeout=1200 \ | ||
--retries 2 \ | ||
--retry-delay 5 \ | ||
--durations=0 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we're repeating flags in these workflows, we could move some of them to config files: https://docs.pytest.org/en/stable/reference/customize.html
"--parameters=model=real_weights.irpa", | ||
"--module=sdxl_scheduled_unet_pipeline_fp16_cpu.vmfb", | ||
"--input=1x4x128x128xf16=@inference_input.0.bin", | ||
"--input=2x64x2048xf16=@inference_input.1.bin", | ||
"--input=2x1280xf16=@inference_input.2.bin", | ||
"--input=1xf16=@inference_input.3.bin", | ||
"--expected_output=1x4x128x128xf16=@inference_output.0.bin", | ||
"--expected_f16_threshold=0.8f" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These flags usually live with the model itself and not as part of the config file, but the "scheduled" part of this model / test case is a bit special. Let's keep an eye on that and aim to refactor later. Could file an issue in the test suite repo.
"config_name": "gpu_rocm", | ||
"iree_compile_flags" : [ | ||
"--iree-hal-target-backends=rocm", | ||
"--iree-rocm-target-chip=gfx90a", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MI250? (anything specific to a particular device should call that out in the file name, so we can add other devices without name conflicts)
"--iree-global-opt-propagate-transposes=true", | ||
"--iree-opt-outer-dim-concat=true", | ||
"--iree-vm-target-truncate-unsupported-floats", | ||
"--iree-llvmgpu-enable-prefetch=true", | ||
"--iree-opt-data-tiling=false", | ||
"--iree-codegen-gpu-native-math-precision=true", | ||
"--iree-codegen-llvmgpu-use-vector-distribution", | ||
"--iree-preprocessing-pass-pipeline=builtin.module(iree-preprocessing-transpose-convolution-pipeline, util.func(iree-preprocessing-pad-to-intrinsics))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default flags (with XFAIL as needed) and "additional" flags would be good to cover here too
- name: "Running SDXL rocm pipeline benchmark" | ||
run: | | ||
source ${VENV_DIR}/bin/activate | ||
bash SHARK-TestSuite/iree_tests/benchmarks/benchmark_sdxl_rocm.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep an eye on this and explore ways to improve the ergonomics.
Brainstorming:
- Document what specifically is being measured (what device, what program, what flags)
- Documentation can be in a comment here, included in logs, hosted on wiki / developer doc / readme file, etc.
- We could plug in to IREE's existing benchmarking infrastructure to get data into https://perf.iree.dev/ and PR comments, but setup there is tricky
- A lighter weight option than posting a PR comment is adding a job summary: https://docs.github.com/en/actions/using-workflows/workflow-commands-for-github-actions#adding-a-job-summary. We use that for e.g. https://github.com/iree-org/iree/actions/runs/8840713851?pr=17183#summary-24276534845
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sure, definitely improvements to be made here
should be good to go for now |
It doesn't have to be in this PR, but let's also keep an eye on config file naming and consider creating some subfolders
We should pick a naming scheme like |
+1 to naming things better going forward (our current situation is bad, so not hypothetical :) |
Yeah sure let's come up with a universal way to name these config files in another PR |
ci-exactly: build_packages, regression_test_cpu, regression_test_amdgpu_vulkan, regression_test_amdgpu_rocm, regression_test_nvidiagpu_vulkan, regression_test_nvidiagpu_cuda Signed-off-by: Lubo Litchev <[email protected]>
ci-exactly: build_packages, regression_test_cpu, regression_test_amdgpu_vulkan, regression_test_amdgpu_rocm, regression_test_nvidiagpu_vulkan, regression_test_nvidiagpu_cuda