-
Notifications
You must be signed in to change notification settings - Fork 10.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-GPU has been broken for me recently. ggml-cuda.cu:7068: invalid argument #3930
Comments
I'm encountering the same issue. Llama 2 70B, 8bit quantized. 2x A100. Compiled with:
Command:
Fails with:
Whereas setting -ngl 0 and running it entirely on CPU runs fine (if slowly). |
I assume it can be related to my changes for CUDA memory pools. Once #3931 is merged try to recompile with GGML_CUDA_FORCE_CUSTOM_MEMORY_POOL and double check. |
@Ph0rk0z Can you bisect at which commit the failure occurs? |
@ggerganov I am seeing the same error. PS: As per #2470 (comment) I am compiling with |
Mine has been broken since: #2268 First it would crash out like setting a too high n_batch does when loading the model. i.e, Trying to allocate massive amounts of system ram. After the memory pool commits it gives the error above. The memory poor PR does not fix it but at least avoids the crash. |
For what it's worth I am seeing this in a fresh build of llama.cpp as well. I am building via the llama_cpp_python package!
|
same error with cuda 12.3: ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no ... CUDA error 1 at /tmp/pip-install-5bufkrrh/llama-cpp-python_9a816a9490ba42a78dfd85cdba57cabf/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument |
Same error here with 2 x T4s using the Python package. It happened to me when redeploying my production Kubernetes environment. I had to quickly downgrade to 1 GPU to get the environment back up. I really do need this fixed ASAP as 1 GPU won't be able to handle load at peak times very well. |
Please test changes from #3931. CUDA pools are optional now. |
After reverting cuda pool stuff it appears to be working again. |
I am getting the same error: should I install a specific version? CUDA error 1 at /tmp/pip-install-1ypw1658/llama-cpp-python_1c1bc0be5c7249408c254fa56f97252b/vendor/llama.cpp/ggml-cuda.cu:7036: invalid argument |
@RachelShalom Try to retest the latest version. |
I installed llama cpp a few hours ago and got this so error. I assume I installed the latest. unless the fix mentioned here is not on a release |
Did you install llama.cpp or llama-cpp-python? I really don't know how quickly llama.cpp propagates to llama-cpp-python. |
python and I am using lancgchain to load the model. I updated langchain and Now I have a new Error: CUDA error 222 at /tmp/pip-install-qcfy69x9/llama-cpp-python_d60a2a3fe09943d5b39a16dab77b98a7/vendor/llama.cpp/ggml-cuda.cu:7043: the provided PTX was compiled with an unsupported toolchain. nvcc --version |
I used llama-cpp-python with langchain, and got the same error: here are the output:
I have two different cards and they worked well with the compiled llama.cpp. But I got error when I tried with llama-cpp-python. :( |
I'm using llama.cpp python too and I just git pull instead of using his cherrypicked revision. Sometimes that's good and sometimes that's bad. |
Same issue for me on 2x A100 80GB PCIe setup with #3586. Running with |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Git llama.cpp with python bindings.
Expected Behavior
Inference works like before.
Current Behavior
Inference fails and llama.cpp crashes.
Environment and Context
python 3.10 / cuda 11.8
Failure Information (for bugs)
Relevant Code
I have some printf's for Nvllink as you see so the line numbers are a little off but here is the snippet that set sit off.
One of the args to cudamemcpy async is invalid. I haven't checked yet which one does it. The day before, it was trying to allocate 5TB of system ram after loading the model but subsequent commits fixed that up. Waited a little to see if that would happen with this since the code is so new and I can't access github from that machine so I have to bring the logs here.
It does it with both P40s and 3090s and is independent of whether I force MMQ or not.
The text was updated successfully, but these errors were encountered: