Containers for Apple Silicon Macs work with GPU-accelerated Vulkan #8042

AndreasKunar · 2024-06-20T19:16:03Z

AndreasKunar
Jun 20, 2024

I just came across a very interesting posting by Sergio López - how to enable GPU-acceleration for MacOS Apple Silicon containers https://sinrega.org/2024-03-06-enabling-containers-gpu-macos/, and was able to reproduce the acceleration results. Basically its routing Vulkan API calls out of the containers to a Vulkan-to-Metal layer in the host via the virtual machine monitor.

With Phi-3 on my M2 Max, I got approx. ~78 token/s token-generation (TG) natively (-ngl 99), ~63 in a container (-ngl 99), ~34 native (-ngl 0), ~20 in a container (-ngl 0). Weirdly the PP numbers are totally strange (needs investigation, better benchmarking).

I wrote a quick&dirty medium-com article about the details (https://medium.com/@andreask_75652/gpu-accelerated-containers-for-m1-m2-m3-macs-237556e5fe0b), but need to analyze it more, once I have more time. I also plan to do real benchmark-numbers comparable to #4167.

To me, the missing containerization with GPU-acceleration always was a strong drawback of Macs. With this there might be a way to solve easy/safe/fast installation also for Macs.

Ideas/Feedback very welcome.

AndreasKunar · 2024-06-24T14:45:24Z

AndreasKunar
Jun 24, 2024
Author

Update: Standard #4167 benchmark results for M2 Max 38 GPU 96 GB RAM MacOs with Fedora Container in Podman 4.9 - 8 CPUs + 32Gb allocated to VM. This DOES support GPU-acceleration via the Vulkan driver. Faster as pure virtualization (8 CPU, see below), but much slower than pure MacOS (see #4167). Not sure what caused the extremely slow Vulkan F16 TG result. PP is supposed to be largely dependent on compute performance, TG largely on memory-bandwidth (with a bit of compute for quantization)..

Vulkan0: Virtio-GPU Venus (Apple M2 Max) | uma: 1 | fp16: 1 | warp size: 32

model	size	params	backend	ngl	test	t/s
llama 7B F16	12.55 GiB	6.74 B	Vulkan	99	pp 512	80.40 ± 0.43
llama 7B F16	12.55 GiB	6.74 B	Vulkan	99	tg 128	2.02 ± 0.00
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan	99	pp 512	79.79 ± 0.07
llama 7B Q8_0	6.67 GiB	6.74 B	Vulkan	99	tg 128	19.79 ± 0.01
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	pp 512	79.94 ± 0.13
llama 7B Q4_0	3.56 GiB	6.74 B	Vulkan	99	tg 128	24.07 ± 0.01

Compared with M2 Max 38 GPU 96 GB RAM MacOs with Ubuntu 24.04 in Parallels 19.4.0 - 8 CPUs + 32Gb allocated to VM (pure CPU execution).

model	size	params	backend	threads	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CPU	8	pp512	26.18 ± 0.06
llama 7B F16	12.55 GiB	6.74 B	CPU	8	tg128	8.19 ± 0.09
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	8	pp512	52.83 ± 0.36
llama 7B Q8_0	6.67 GiB	6.74 B	CPU	8	tg128	13.79 ± 0.49
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	pp512	49.04 ± 0.06
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	8	tg128	22.86 ± 0.16

1 reply

AndreasKunar Jul 14, 2024
Author

Update Jul 14, 2024 - CPU with new Q4_0_4_4 quantization on Apple is faster than Vulkan-to-Metal from VM/Container, and much easier to implement.

The changes of #5780 enable very fast 4-Bit quantized CPU-inference in llama.cpp for ARM-CPUs. There the results from within a VM (which also apply to containers) - contrary to above, this was from a M2 4-core only (I still need to run it on the M2 Max)

Vulkan: 1.6x
Q4_0_4_4 vs Q4_0 on CPU: 2.4x

model	size	params	backend	threads	test	t/s
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	pp512	21.45 ± 0.39
llama 7B Q4_0	3.56 GiB	6.74 B	CPU	4	tg128	12.17 ± 0.15
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	CPU	4	pp512	50.63 ± 1.46
llama 7B Q4_0_4_4	3.56 GiB	6.74 B	CPU	4	tg128	14.01 ± 1.17

With this, I'm giving up on Vulkan GPU-acceleration for Mac containers for now.

hybra · 2024-07-09T07:39:49Z

hybra
Jul 9, 2024

Interersting, I just read your article but I didn't try the setup myself yet. Do you have some updates after two weeks? Looking for a containerized Ollama solution myself that could take advantage of Apple Silicon's GPUs.
Does it work consistently or is it buggy? Also wondering why it doesn't work with Podman 5, but no problems for 4.9.

2 replies

AndreasKunar Jul 9, 2024
Author

Sorry, I could not spend more time with it (currently moved to see, how llama.cpp can best run on the Snapdragon X).

It seemed to work consistently with 4.9 and strictly according to Sergio López's write-up. Apparently Podman changed the architecture a bit with 5.0 and the slk/krunkit VMM does not seem to work anymore and transport the Vulkan-calls from the container to the "Metal" in the host. I tried to also get krunkit to work in a full vm (qemu), but could not get it to work. So I parked it for now.

I still want arm GPU/NPU acceleration inside containers/VMs, and think Vulkan could be a great idea for this.

AndreasKunar Jul 14, 2024
Author

Interersting, I just read your article but I didn't try the setup myself yet. Do you have some updates after two weeks? Looking for a containerized Ollama solution myself that could take advantage of Apple Silicon's GPUs. Does it work consistently or is it buggy? Also wondering why it doesn't work with Podman 5, but no problems for 4.9.

@hybra you might look into the CPU acceleration via the new Q_4_0_4_4 in pull request #5780 . See my benchmarking results in VMs (probably identical to containers) in the end, it's 2.5x acceleration is more, than I achieved via the Vulkan pass-through. I'm giving up on Vulkan pass-thru on Apple for now.

Apple's Vitualization really seems slow. I got a new SnapdragonX (the cheapest Surface) and there the speed of native Windows vs. containerized/VM Linux is more or less the same. I could not yet get GPU/NPU-acceleration to work with it, but pure CPU looks promising for container or "safe" LLM deployment on low-power/low-cost machines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containers for Apple Silicon Macs work with GPU-accelerated Vulkan #8042

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Containers for Apple Silicon Macs work with GPU-accelerated Vulkan #8042

AndreasKunar Jun 20, 2024

Replies: 2 comments · 3 replies

AndreasKunar Jun 24, 2024 Author

AndreasKunar Jul 14, 2024 Author

Update Jul 14, 2024 - CPU with new Q4_0_4_4 quantization on Apple is faster than Vulkan-to-Metal from VM/Container, and much easier to implement.

hybra Jul 9, 2024

AndreasKunar Jul 9, 2024 Author

AndreasKunar Jul 14, 2024 Author

AndreasKunar
Jun 20, 2024

Replies: 2 comments 3 replies

AndreasKunar
Jun 24, 2024
Author

AndreasKunar Jul 14, 2024
Author

hybra
Jul 9, 2024

AndreasKunar Jul 9, 2024
Author

AndreasKunar Jul 14, 2024
Author