Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARM docker build with 24.12 pytorch fw image #581

Merged
merged 4 commits into from
Jan 10, 2025

Conversation

trvachov
Copy link
Collaborator

@trvachov trvachov commented Jan 8, 2025

Description

Updates ARM Dockerfile to work with 24.12 pytorch FW image

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

ARM build is not covered by pre-merge CI, is covered nightly. Having said that, this change might not work with CI at all because the blossom runners have a kernel/CUDA mismatch on their Grace systems.... more info TBD

Usage

docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it build-image-name:tag /bin/bash

Pre-submit Checklist

  • I have tested these changes locally
  • [N/A] I have updated the documentation accordingly
  • [N/A] I have added/updated tests as needed
  • All existing tests pass successfully

@trvachov trvachov requested a review from pstjohn January 8, 2025 17:27
@trvachov trvachov force-pushed the trvachov/25.01-arm-docker-build branch from fcf8372 to 5918b45 Compare January 8, 2025 17:28
Copy link
Collaborator

@pstjohn pstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there tests for this anywhere ATM? Do we need to be testing ARM platforms as part of our pre-merge testing generally? That would definitely need to be something we discuss with the azure runner folks if so

@trvachov trvachov force-pushed the trvachov/25.01-arm-docker-build branch from 5918b45 to 1bc2f6d Compare January 8, 2025 19:32
Dockerfile.arm Outdated
tensorstore==0.1.45

# For some reason, we do not need to do the tensorstore verson package hack on arm64
# RUN sed -i 's/^Version: 0\.0\.0$/Version: 0.1.45/' /usr/local/lib/python3.12/dist-packages/tensorstore-0.0.0.dist-info/METADATA && mv /usr/local/lib/python3.12/dist-packages/tensorstore-0.0.0.dist-info /usr/local/lib/python3.12/dist-packages/tensorstore-0.1.45.dist-info
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo remove

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Dockerfile.arm Outdated

cd ${WORKDIR} && \
RUN cd ${WORKDIR} && \
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pstjohn I broke up this run command into two -- my understanding is i'm doing this in an earlier stage of the docker build therefore this doesn't have negative image size consequences -- am I correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all these layers end up in our release image, groan. I was thinking I would work a bit on trying to clean that up. But I think if you want to split these, that's fine; but you should move the

rm -rf llvm-project

line up to this block (if you can)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cant move the rm -rf llvm-project up, because that's where the static libraries are against which triton compiles. I will just merge them back. I split these commands for debugging purposes.

Copy link
Collaborator

@gagank1 gagank1 Jan 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to merging them back. I put it together because the LLVM build cache is huge and must be deleted in the same RUN command to actually reduce size

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep -- makes total sense -- thanks for the feedback.
Merged the two commands again. @gagank1 can you re-review?

@trvachov trvachov force-pushed the trvachov/25.01-arm-docker-build branch from 1bc2f6d to 597c408 Compare January 8, 2025 23:47
@trvachov
Copy link
Collaborator Author

trvachov commented Jan 8, 2025

/build-ci

Copy link
Collaborator

@pstjohn pstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure LGTM if the tests pass 😅 , but tbh I'm not 100% clear on what's going on here with the ffmpeg stuff

@trvachov trvachov force-pushed the trvachov/25.01-arm-docker-build branch from 597c408 to b11b2fc Compare January 9, 2025 15:55
@trvachov trvachov requested a review from ohadmo as a code owner January 9, 2025 15:55
@trvachov
Copy link
Collaborator Author

trvachov commented Jan 9, 2025

/build-ci

@trvachov trvachov requested a review from gagank1 January 9, 2025 19:44
rm -rf decord
apt-get install -y ffmpeg libavcodec-dev libavfilter-dev libavformat-dev libavutil-dev
# && cp /usr/lib/aarch64-linux-gnu/libnvcuvid* /usr/local/cuda/
RUN --mount=type=bind,source=./arm_build/decord_ffmpeg6_fix.patch,target=/decord_ffmpeg6_fix.patch \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my own knowledge - what is this for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decord doesn't build againt the latest ffmpeg. Rather than downgrading ffmpeg (which caused issues for me), someone provided a patch file to make decord compatible with the newest ffmpeg. The link in the comment has a description: dmlc/decord#186 (comment)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, i wonder why ffmpeg started causing issues now

@trvachov
Copy link
Collaborator Author

trvachov commented Jan 9, 2025

/build-ci

@codecov-commenter
Copy link

codecov-commenter commented Jan 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 87.73%. Comparing base (43d2ca3) to head (335758f).

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #581   +/-   ##
=======================================
  Coverage   87.73%   87.73%           
=======================================
  Files          89       89           
  Lines        5758     5758           
=======================================
  Hits         5052     5052           
  Misses        706      706           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@trvachov trvachov merged commit d36c18e into main Jan 10, 2025
7 of 8 checks passed
@trvachov trvachov deleted the trvachov/25.01-arm-docker-build branch January 10, 2025 04:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants