Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an initial test_linux_packages sub-workflow to our CI #123

Merged
merged 30 commits into from
Mar 3, 2025

Conversation

amd-chrissosa
Copy link
Contributor

@amd-chrissosa amd-chrissosa commented Feb 28, 2025

This PR introduces testing on our MI300 fleet focused on installing rocminfo and its dependencies that were built and uploaded into s3 from the previous build_linux_packages phase.

A few callouts here:

  1. Hard-coded dependencies for rocminfo, we should theoretically script this out and be smarter about it.
  2. Not running within a container on the MI300, ran into jank and didn't seem worth it. The only thing I need from the build docker is aws so I install it.
  3. Have to be careful about extracting the tarball on /. Initially it was overwriting /bin and killing sudo/readlink until I added the --keep-directory-symlink.

A few other notes, running this fast is a bit challenging. We may want to add a couple other github actions to make debugging easier. What I ended up doing was debugging with only the test part and a hardcoded prior CI run on an ubuntu machine first. Switched to an MI300. Then finally switched to depending on the build_linux_packages workflow.

@amd-chrissosa amd-chrissosa changed the title Initial test script for running rocminfo Add an initial test_linux_packages sub-workflow to our CI Feb 28, 2025
@amd-chrissosa amd-chrissosa marked this pull request as ready for review February 28, 2025 07:46
@amd-chrissosa amd-chrissosa self-assigned this Feb 28, 2025
@amd-chrissosa
Copy link
Contributor Author

About ready - fixing a couple last things possibly with the rebase but ready for inspection!

Copy link
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mysteriously the pipeline failed even though I was able to download the file that is logged to be non-existing. I re-triggered the pipeline step to see where it takes us.

Also left some comments. I think we should use lower privileges for download and maybe even for unpacking.

Copy link
Contributor

@geomin12 geomin12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good for the initial tests

i wonder if we could pair build_linux_packages with this to avoid having to re-download the artifacts, although the file sizes are so small it should be golden. may be something to consider when we start looking into larger artifacts

@marbre
Copy link
Member

marbre commented Feb 28, 2025

i wonder if we could pair build_linux_packages with this to avoid having to re-download the artifacts, although the file sizes are so small it should be golden. may be something to consider when we start looking into larger artifacts

That would mean to merge the two jobs as files can only be shared within a job. However, we're planing to break the break_linux_packages into multiple jobs to allow parallel runes. Thus I don't think there is any way that allows to avoid downloading the artifacts.

@amd-chrissosa
Copy link
Contributor Author

i wonder if we could pair build_linux_packages with this to avoid having to re-download the artifacts, although the file sizes are so small it should be golden. may be something to consider when we start looking into larger artifacts

That would mean to merge the two jobs as files can only be shared within a job. However, we're planing to break the break_linux_packages into multiple jobs to allow parallel runes. Thus I don't think there is any way that allows to avoid downloading the artifacts.

We also don't want to any needless compute on MI300s since they are a very hot commodity. We don't want to compile/build on MI300s.

- name: Run rocminfo
run: |
echo "Running rocminfo"
/bin/rocminfo
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the artifacts stored for run 13583436756, the recent rework of the layering and with (slightly adjusted) steps above, rocminfo isn't unpacked to /tmp/bin/rocminfo fior me locally. Are you running something that is already pre-existing on the system?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've swapped it around to fix the issue you mentioned above. I did run into an issue where I wasn't putting it in the right place but tweaked a couple things and made an rsync with --verbose to ensure I'm copying. I'll play around with moving to /tmp maybe monday or could do in a follow-up. I could see a reasonable argument to doing not in / so we won't need sudo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for now you can just skip moving files at all :) This also makes sure we don't pollute test machines if test are not running containerized. Commented as part of my last review.

@amd-chrissosa
Copy link
Contributor Author

Ready for another review. Addressed all comments and incorporated fileset_tool in order to unpack paths.

@amd-chrissosa amd-chrissosa requested a review from marbre March 1, 2025 07:11
Comment on lines +11 to +13
push:
branches:
- ADHOCBUILD
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leftofter? Or should this rather be an input passed to workflow_dispatch and workflow_call, similar as in the build_linux_packages workflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stanza is required even if empty if the workflow can be called by another workflow.

pushd "${BUILD_ARTIFACTS_DIR}"
mkdir output_dir
python ${{ github.workspace }}/build_tools/fileset_tool.py artifact-flatten *.tar.xz -o output_dir --verbose
sudo rsync --archive --keep-dirlinks --verbose output_dir/* /
Copy link
Member

@marbre marbre Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think with that you have now you can just drop this line

Suggested change
sudo rsync --archive --keep-dirlinks --verbose output_dir/* /

in Run rocminfo you can then run

${{ github.workspace }}/${BUILD_ARTIFACTS_DIR}/output_dir/bin/rocminfo

- name: Run rocminfo
run: |
echo "Running rocminfo"
/bin/rocminfo
Copy link
Member

@marbre marbre Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/bin/rocminfo
${{ github.workspace }}/${BUILD_ARTIFACTS_DIR}/output_dir/bin/rocminfo

@amd-chrissosa
Copy link
Contributor Author

amd-chrissosa commented Mar 3, 2025

Going to land this as is and then address comments in follow-up to make it a bit easier to debug with the workflow checked in on main! That ok on you @marbre ?

Copy link
Member

@marbre marbre left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's land this and improve in-tree :)

@amd-chrissosa amd-chrissosa merged commit 766eca8 into main Mar 3, 2025
3 checks passed
@amd-chrissosa amd-chrissosa deleted the add-testing-with-ossci branch March 3, 2025 18:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants