Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Dawn C++ WebGPU backend #837

Closed
loretoparisi opened this issue Apr 7, 2023 · 10 comments
Closed

[Feature Request] Dawn C++ WebGPU backend #837

loretoparisi opened this issue Apr 7, 2023 · 10 comments
Labels
enhancement New feature or request performance Speed related topics stale

Comments

@loretoparisi
Copy link

Today Chrome released WebGPU support in Chrome Beta.
The Google's Dawn project is a C++ standalone implementation of the WebGPU. It enables support of WebGPU in other libraries, by example this WIP are NodeJS binding to Dawn, that would enable - in theory - WebGPU in Node.
So it should be possible to add Dawn as GPU backend to Llama/GGML C++ math operations.

@LiliumSancta
Copy link

Here is an implementation of sd using webgpu in chrome, not dawn. I found it interesting, but I don't know if it's useful for llama.cpp https://github.com/mlc-ai/web-stable-diffusion

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 11, 2023

Agreed that chrome makes more sense. If you want to run on GPU on local, you should just run pytorch. The whole point of llama.cpp is that you have no deps.

I think that running on the user's browser is a very interesting idea, but in practice, it may be slow. Btw, webGPU API is constrained compared with CUDA. I wonder if you will get good performance.

@kadogo
Copy link

kadogo commented Apr 17, 2023

Hello,

I did a test with their implementation https://github.com/mlc-ai/web-llm and in the feeling I think that the speed is maybe a little slower than with llama.cpp but I only have an Iris Xe.

What it is interesting is that it's recognize my intel card, I don't think it's possible easily with the basic pytorch.

It could be interesting to see it would be possible to use GPU and CPU together in llama.cpp? Even as an option, it could be nice if it can win a few tokens.

@loretoparisi
Copy link
Author

Hello,

I did a test with their implementation https://github.com/mlc-ai/web-llm and in the feeling I think that the speed is maybe a little slower than with llama.cpp but I only have an Iris Xe.

What it is interesting is that it's recognize my intel card, I don't think it's possible easily with the basic pytorch.

It could be interesting to see it would be possible to use GPU and CPU together in llama.cpp? Even as an option, it could be nice if it can win a few tokens.

I have tested it locally as well. It works pretty fast with a 4GB 4bit quantized vicuna 7B model. Web-llm is using Apache TVM unity based on the IRTensor (IRModule), compiled with emscripten and WASM for the SentencePiece tokenizer. This will natively support WebGPU on different devices, but it's technologically challenging, let's consider that web-llm is from devs involved in TVM Unity development. Stack and components involved are a lot and it's far to be simple as GGML idea was.

@loretoparisi
Copy link
Author

It's worth to note this naive GPT implementation in vanilla JavaScript that support WebGPU.
https://github.com/0hq/WebGPT

@sw sw added enhancement New feature or request performance Speed related topics labels Apr 23, 2023
@sw
Copy link
Contributor

sw commented Apr 23, 2023

Llama.cpp specifically targets the CPU, so it's unlikely such a dependency will be added, but see the discussion in #915.

@audiovention
Copy link

I've done a small first step towards that:
ggerganov/ggml#585

@ei-grad
Copy link

ei-grad commented Oct 30, 2023

Would WebGPU solve the 32-bit memory issue since most of layers/computations would come to the GPU memory? #97

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@loretoparisi
Copy link
Author

@ggerganov Hello, thanks to your GGUF release of Llama-3.2-1B-Instruct-Q4_K_M-GGUF that is just 800MB, and it can be easily sharded to few chunks, and there is no need of WASM64, could be worth to attempt to load to WASM simd in the browser? Most browsers now have infact support to simd and thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Speed related topics stale
Projects
None yet
Development

No branches or pull requests

7 participants