-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebAssembly and emscripten headers #97
Comments
Without https://github.com/WebAssembly/memory64 implemented in web assembly you are going to run into show stopping memory issues with the current 4GB limit due to 32bit addressing. Do you have a plan for getting around that? |
If you quantized the 7B model to a mixture of 3-bit and 4-bit quantization using https://github.com/qwopqwop200/GPTQ-for-LLaMa then you could stay within that memory envelope. |
I think that's a reasonable proposal @Dicklesworthstone. A purely 3-bit implementation of llama.cpp using GPTQ could retain acceptable performance and solve the same memory issues. There's an open issue for implementing GPTQ quantization in 3-bit and 4-bit. GPTQ Quantization (3-bit and 4-bit) #9. Other use cases could benefit from this same enhancement, such as getting 65B under 32GB and 30B under 16GB to further extend access to (perhaps slightly weaker versions of) the larger models. |
https://twitter.com/nJoyneer/status/1637863946383155220 I was able to run Following the Emscripten version used:
Following the compile flags:
Following the minimal patch:
|
The |
So given the limit of WASM 64, you have to go for 3 and 4 bit quantization using GPTQ I think. |
It's already quantized 4bits when converting. |
@thypon apparently memory64 is available in firefox nightly, did you check it ? |
The new RedPajama-3B seems like a nice tiny model that could probably fit without memory64. |
@thypon @loretoparisi I'm curious, what sort of performance drop did you notice running in browser from running natively? How many toks/sec were you getting? |
@IsaacRe Did not make a performance comparison since it was not 100% stable and needed to be refined. As mentioned it was single core since multithreaded + memory64 on firefox nightly was not working properly together, and crashing the experiment. @okpatil4u already running with experimental memory64 |
Hey @thypon, did you make any progress on this experiment ? |
I'm not actively working on this at the current stage. |
@okpatil4u |
I've tried the approach suggested by @lukestanley and @loretoparisi and got starcoder.cpp to run on browser. I tried with tiny_starcoder_py model as the weight size were quite small to fit without mem64, and tried to see the performance/accuracy. It seems like the output of the model without mem64 is gibberish while mem64 version results in meaningful output. Not sure if memory addressing in 32bit vs 64bit has to do with it. |
How about WebGPU? Probably better to run it off-CPU where possible anyhow? (full disclosure: I have no idea what I'm talking about.) |
The implementation status is complete for emscripten: |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Not sure what is the progress here, apparently there are overlapping or related opened issues. |
There is this project that might be relevant: https://github.com/ngxson/wllama |
@ggerganov Thanks for sharing that. I'm already using https://github.com/tangledgroup/llama-cpp-wasm as the basis of a big project. So far llama-cpp-wasm has allowed me to run pretty much any .gguf that is less than 2GB in size in the browser (and that limitation seems to be related to the caching mechanism of that project, so I suspect the real limit would be 4GB). People talk about bringing AI to the masses, but the best way to do that is with browser-based technology. My mom is never going to install Ollama and the like. |
But would your mom be OK with her browser downloading gigabytes of weights upon page visit? To me this seems like the biggest obstacle for browser-based LLM inference |
Agreed, the best example so far is LLM MLC, web version: you can see that it can download 4GB in shards like 20 shards or so for a Llama-2 7B weights, 4-bit quantized. Of course this means that you can wait from tens seconds to few minutes to start the inference. And this is not going to change soon unless the quantization at 3,2bits works better and the accuracy is good as the 4bit... By example, if we take Llama-2, 8B we have 108 shards and it took 114 seconds to complete on my fiber channel: on Mac M1 Pro I get
|
Huggingface has recently released a streaming option for GGUF, where you can already start inference even though the model is noy fully loaded yet. At least, that's my understanding from a recent Youtube video by Yannic Kilcher. For my project I'm trying to use a less than 2Gb quant of Phi 2 with 128K context. I think that model will become the best model for browser-based used for a while. |
You may be thinking of a library that Huggingface released that can read GGUF metadata without downloading the whole file. You wouldn't gain much from streaming the model for inference, generally the entire model is necessary to generate every token. |
@slaren Ah, thanks for clarifying that. It sounded a little too good to be true :-) |
Hello I have tried a minimal Emscripten support to
Makefile
addingIt complies ok with both
em++
andemcc
. At this stage the problem is thatmain.cpp
andquantize.cpp
does not expose a proper header file, and I cannot callmain
as a module, or a function export using EmscriptenEMSCRIPTEN_KEEPALIVE
to main by example.In fact a simple C++ headers could be compiled as a node module and then called like
and then executed in node scripts like
The text was updated successfully, but these errors were encountered: