[auto_scheduler] Part.1 metal default hardware params #7022

antinucleon · 2020-12-03T08:38:45Z

This is the first PR to enable auto-scheduler working on Metal desktop GPU (AMD / M1).

To fully enable auto-scheduler on Metal, the following PRs are required:

Default hardware params
Register buffers in measure
Specialized local runner/builder without dependency on Python multiprocessing.

FrozenGene · 2020-12-03T08:53:58Z

@antinucleon I heard from @merrymercy you are trying new Apple Silicon Mac with auto scheduler (CPU / GPU). For CPU, I heard llvm can not recognize the target parameter. I just tried a little and referred from https://reviews.llvm.org/D82699. We could set it -mtriple=arm64-apple-macos -mcpu=apple-a12 using latest llvm from git (clang version 12.0.0 (https://github.com/llvm/llvm-project.git 61a06c071dd16a9725d3b7bfac806520dc1b95aa). Would you try this a bit? Maybe we could get performance improvement if we set the correct target to llvm. Thanks!

antinucleon · 2020-12-03T09:04:14Z

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)

Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.

A few tips for people who want to compile LLVM on MacMini directly:

Use make -j2 at beginning
When make is killed because of OOM, reboot
Use make -j2 again it will be successful :D

FrozenGene · 2020-12-03T09:13:57Z

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)

Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.

A few tips for people who want to compile LLVM on MacMini directly:

Use make -j2 at beginning

When make is killed because of OOM, reboot

Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary? For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?

antinucleon · 2020-12-03T09:27:41Z

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)
Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.
A few tips for people who want to compile LLVM on MacMini directly:

Use make -j2 at beginning

When make is killed because of OOM, reboot

Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary?
I need to double-check. I was hit by segfault but didn't dive into it.
For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?
There is a gcc port, which conda is using. Which we may use.

My guess is: we don't know whether M1 is using some new vector / FMA instruction. If so, we can only obtain good perf until Apple contributes back to upstream LLVM. But I hope we will have more sense this week.

FrozenGene · 2020-12-03T10:06:35Z

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)
Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.
A few tips for people who want to compile LLVM on MacMini directly:

Use make -j2 at beginning

When make is killed because of OOM, reboot

Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary?
I need to double-check. I was hit by segfault but didn't dive into it.
For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?
There is a gcc port, which conda is using. Which we may use.

My guess is: we don't know whether M1 is using some new vector / FMA instruction. If so, we can only obtain good perf until Apple contributes back to upstream LLVM. But I hope we will have more sense this week.

Right. This is my previous reply's point. I refer the link https://reviews.llvm.org/D82699 and we could set it -mtriple=arm64-apple-macos -mcpu=apple-a12. I guess maybe this could be better. For example we could just tune one matmul. However, your guess could also maybe right.

antinucleon · 2020-12-03T10:10:15Z

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)
Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.
A few tips for people who want to compile LLVM on MacMini directly:

Use make -j2 at beginning

When make is killed because of OOM, reboot

Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary?
I need to double-check. I was hit by segfault but didn't dive into it.
For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?
There is a gcc port, which conda is using. Which we may use.

My guess is: we don't know whether M1 is using some new vector / FMA instruction. If so, we can only obtain good perf until Apple contributes back to upstream LLVM. But I hope we will have more sense this week.

Right. This is my previous reply's point. I refer the link https://reviews.llvm.org/D82699 and we could set it -mtriple=arm64-apple-macos -mcpu=apple-a12. I guess maybe this could be better. For example we could just tune one matmul. However, your guess could also maybe right.

FYI: 2k by 2k matmul, Apple Accelerate library is able to achieve around 600GFLOPS, which is 1/3 of the peak FLOPS of M1 GPU. While running Ansor + llvm11 with -mcpu=apple-latest -target=arm64-apple-darwin20.1.0 flag, is able to get approx 250GFLOPS, so roughly 1/2 speed of Apple Accelerate

FrozenGene · 2020-12-03T11:11:16Z

@FrozenGene I tried it is tricky lol. I need to cross compile an ARM 64 LLVM. I did once but seems it was still x86-64 binary, then it will cause ARM conda segfault. If I don't work with conda, the tricky part is scipy, which requires gcc. I probably need to cross compile an ARM gcc, then compile LAPACK + Scipy. Or I can eliminate scipy dependency in TVM (which I have done in one version)
Last time I don't have enough time to figure out how to correctly cross compile ARM64 LLVM, maybe I can try again this week.
A few tips for people who want to compile LLVM on MacMini directly:

Use make -j2 at beginning

When make is killed because of OOM, reboot

Use make -j2 again it will be successful :D

When you build it on the Apple Silicon machine, LLVM can not recognize correct host target and still produce x86 binary?
I need to double-check. I was hit by segfault but didn't dive into it.
For the GCC part, could we build LLVM with Clang together then make a softlink gcc pointing to clang like Apple doing it on Mac?
There is a gcc port, which conda is using. Which we may use.

My guess is: we don't know whether M1 is using some new vector / FMA instruction. If so, we can only obtain good perf until Apple contributes back to upstream LLVM. But I hope we will have more sense this week.

Right. This is my previous reply's point. I refer the link https://reviews.llvm.org/D82699 and we could set it -mtriple=arm64-apple-macos -mcpu=apple-a12. I guess maybe this could be better. For example we could just tune one matmul. However, your guess could also maybe right.

FYI: 2k by 2k matmul, Apple Accelerate library is able to achieve around 600GFLOPS, which is 1/3 of the peak FLOPS of M1 GPU. While running Ansor + llvm11 with -mcpu=apple-latest -target=arm64-apple-darwin20.1.0 flag, is able to get approx 250GFLOPS, so roughly 1/2 speed of Apple Accelerate

@antinucleon Could you use lib.save("a.ll") to see the attribute? I find when I use arm64-apple-darwin20.1.0, the llvm ir attribute is different with arm64-apple-macos when I use the latest llvm from git source.

[auto_scheduler] metal default hardware params

f866bc6

merrymercy approved these changes Dec 3, 2020

View reviewed changes

merrymercy merged commit 965a67e into apache:main Dec 3, 2020

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 3, 2020

[auto_scheduler] metal default hardware params (apache#7022)

07d131f

trevor-m pushed a commit to trevor-m/tvm that referenced this pull request Dec 4, 2020

[auto_scheduler] metal default hardware params (apache#7022)

ea7c167

trevor-m pushed a commit to neo-ai/tvm that referenced this pull request Dec 4, 2020

[auto_scheduler] metal default hardware params (apache#7022)

b66598f

antinucleon deleted the metal branch December 7, 2020 11:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[auto_scheduler] Part.1 metal default hardware params #7022

[auto_scheduler] Part.1 metal default hardware params #7022

antinucleon commented Dec 3, 2020

FrozenGene commented Dec 3, 2020

antinucleon commented Dec 3, 2020 •

edited

Loading

FrozenGene commented Dec 3, 2020

antinucleon commented Dec 3, 2020

FrozenGene commented Dec 3, 2020

antinucleon commented Dec 3, 2020 •

edited

Loading

FrozenGene commented Dec 3, 2020

[auto_scheduler] Part.1 metal default hardware params #7022

[auto_scheduler] Part.1 metal default hardware params #7022

Conversation

antinucleon commented Dec 3, 2020

FrozenGene commented Dec 3, 2020

antinucleon commented Dec 3, 2020 • edited Loading

FrozenGene commented Dec 3, 2020

antinucleon commented Dec 3, 2020

FrozenGene commented Dec 3, 2020

antinucleon commented Dec 3, 2020 • edited Loading

FrozenGene commented Dec 3, 2020

antinucleon commented Dec 3, 2020 •

edited

Loading

antinucleon commented Dec 3, 2020 •

edited

Loading