Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[auto_scheduler] Part.1 metal default hardware params #7022

Merged
merged 1 commit into from
Dec 3, 2020

Conversation

antinucleon
Copy link
Contributor

This is the first PR to enable auto-scheduler working on Metal desktop GPU (AMD / M1).

To fully enable auto-scheduler on Metal, the following PRs are required:

  • Default hardware params
  • Register buffers in measure
  • Specialized local runner/builder without dependency on Python multiprocessing.

@FrozenGene
Copy link
Member

@antinucleon I heard from @merrymercy you are trying new Apple Silicon Mac with auto scheduler (CPU / GPU). For CPU, I heard llvm can not recognize the target parameter. I just tried a little and referred from https://reviews.llvm.org/D82699. We could set it -mtriple=arm64-apple-macos -mcpu=apple-a12 using latest llvm from git (clang version 12.0.0 (https://github.com/llvm/llvm-project.git 61a06c071dd16a9725d3b7bfac806520dc1b95aa). Would you try this a bit? Maybe we could get performance improvement if we set the correct target to llvm. Thanks!

@antinucleon
Copy link
Contributor Author

antinucleon commented Dec 3, 2020

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)

Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.

A few tips for people who want to compile LLVM on MacMini directly:

  • Use make -j2 at beginning
  • When make is killed because of OOM, reboot
  • Use make -j2 again it will be successful :D

@FrozenGene
Copy link
Member

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)

Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.

A few tips for people who want to compile LLVM on MacMini directly:

  • Use make -j2 at beginning
  • When make is killed because of OOM, reboot
  • Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary? For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?

@antinucleon
Copy link
Contributor Author

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)
Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.
A few tips for people who want to compile LLVM on MacMini directly:

  • Use make -j2 at beginning
  • When make is killed because of OOM, reboot
  • Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary?
I need to double-check. I was hit by segfault but didn't dive into it.
For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?
There is a gcc port, which conda is using. Which we may use.

My guess is: we don't know whether M1 is using some new vector / FMA instruction. If so, we can only obtain good perf until Apple contributes back to upstream LLVM. But I hope we will have more sense this week.

@FrozenGene
Copy link
Member

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)
Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.
A few tips for people who want to compile LLVM on MacMini directly:

  • Use make -j2 at beginning
  • When make is killed because of OOM, reboot
  • Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary?
I need to double-check. I was hit by segfault but didn't dive into it.
For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?
There is a gcc port, which conda is using. Which we may use.

My guess is: we don't know whether M1 is using some new vector / FMA instruction. If so, we can only obtain good perf until Apple contributes back to upstream LLVM. But I hope we will have more sense this week.

Right. This is my previous reply's point. I refer the link https://reviews.llvm.org/D82699 and we could set it -mtriple=arm64-apple-macos -mcpu=apple-a12. I guess maybe this could be better. For example we could just tune one matmul. However, your guess could also maybe right.

@antinucleon
Copy link
Contributor Author

antinucleon commented Dec 3, 2020

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)
Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.
A few tips for people who want to compile LLVM on MacMini directly:

  • Use make -j2 at beginning
  • When make is killed because of OOM, reboot
  • Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary?
I need to double-check. I was hit by segfault but didn't dive into it.
For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?
There is a gcc port, which conda is using. Which we may use.

My guess is: we don't know whether M1 is using some new vector / FMA instruction. If so, we can only obtain good perf until Apple contributes back to upstream LLVM. But I hope we will have more sense this week.

Right. This is my previous reply's point. I refer the link https://reviews.llvm.org/D82699 and we could set it -mtriple=arm64-apple-macos -mcpu=apple-a12. I guess maybe this could be better. For example we could just tune one matmul. However, your guess could also maybe right.

FYI: 2k by 2k matmul, Apple Accelerate library is able to achieve around 600GFLOPS, which is 1/3 of the peak FLOPS of M1 GPU. While running Ansor + llvm11 with -mcpu=apple-latest -target=arm64-apple-darwin20.1.0 flag, is able to get approx 250GFLOPS, so roughly 1/2 speed of Apple Accelerate

@FrozenGene
Copy link
Member

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)
Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.
A few tips for people who want to compile LLVM on MacMini directly:

  • Use make -j2 at beginning
  • When make is killed because of OOM, reboot
  • Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary?
I need to double-check. I was hit by segfault but didn't dive into it.
For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?
There is a gcc port, which conda is using. Which we may use.

My guess is: we don't know whether M1 is using some new vector / FMA instruction. If so, we can only obtain good perf until Apple contributes back to upstream LLVM. But I hope we will have more sense this week.

Right. This is my previous reply's point. I refer the link https://reviews.llvm.org/D82699 and we could set it -mtriple=arm64-apple-macos -mcpu=apple-a12. I guess maybe this could be better. For example we could just tune one matmul. However, your guess could also maybe right.

FYI: 2k by 2k matmul, Apple Accelerate library is able to achieve around 600GFLOPS, which is 1/3 of the peak FLOPS of M1 GPU. While running Ansor + llvm11 with -mcpu=apple-latest -target=arm64-apple-darwin20.1.0 flag, is able to get approx 250GFLOPS, so roughly 1/2 speed of Apple Accelerate

@antinucleon Could you use lib.save("a.ll") to see the attribute? I find when I use arm64-apple-darwin20.1.0, the llvm ir attribute is different with arm64-apple-macos when I use the latest llvm from git source.

@merrymercy merrymercy merged commit 965a67e into apache:main Dec 3, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 3, 2020
trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020
trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020
@antinucleon antinucleon deleted the metal branch December 7, 2020 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants