-
Notifications
You must be signed in to change notification settings - Fork 7.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Enable GPU acceleration #425
Conversation
I did notice the duplicate in README.md. Will correct that |
That looks great. I have a few questions though.
Ok, that is more than a few questions, but I really do like what you have done. |
In your README.md section, it looks like you need a: |
That could be done, but that'd also add yet another requirement to the project, which I imagine is very not cool for the users who won't be utilising their GPU.
That's a good point, I'll be looking into foolproofing that.
If the user wants to run a GPU environment, running a GPT4All is simply pointless, so maybe I should move it even to the beginning of the file to prevent time loss due to loading embeddings etc. It's just more than a warning that'll make fail the script to get the attention of the user.
That is also a good idea, I'm thinking of adding
I don't think there is any other way of accomplishing it feasibly, or even calculating and adjusting the GPU layers without rewriting the
I have tested with: All of the models here that have
It makes almost no difference.
Without CUDA enabled:
This might be about my little cute GTX-965M though. I've implemented it in 76f042a regardless. Further testing is welcome.
I avoided adding that because people that are running on low resources will get affected. On low-end computers like mine (with an i7-6700HQ) the device can get near unusable (my laptop crashes when it's on 100% for too long, and no it's not a cooling issue). Plus, since this PR is mostly about GPU acceleration/utilisation, I doubt that this would be the place to implement that?
PyTorch supports only 11.7 and 11.8 currently. In order to not break anything (like user environments etc.) 11.8 was the pick. |
Apparently I cd'ed back one too much... Thanks for the feedback. |
@maozdemir, thanks for responding. Looks like you might be in a different time zone.
If the project is to use GPUs generally or well, pycuda is inevitable, if not strictly required today, In my LTH opinion, calling a library function is always preferable to invoking a shell. Invoking a shell has a long history of introducing both security vulnerabilities and failure opportunities.
I mention later the two pref sets, CPU and GPU. Until we get someplace like that the user will be able to switch to GPU and back to CPU by just changing the model name while in the fail-it strategy will require two modification each time. With the usual reduction of success likelihood.
Since the thread count setting proposed is tied to the capabilities of the machine (and set conservatively), it should not choke a machine that is capable of running an LLM with same day service. Note that this is set to Real Core count, not Virtual Thread count. Thus, on your machine it would use 4 threads. You could also add an optional env value of max threads... And it is a huge win on the machines that I run on. BTW, I have an idea for accurate determination of memory requirements on the GPU for setting the layer counts. I will let you know if it actually works. Thanks for listening to my questions and considering my suggestions. |
I would love further instructions on how to exactly specify the model for GPU usage in the When trying to run the GPU version, the ingest works fine but this does not:
I have a feeling that there needs to be a clear documentation for that. |
@Kaszanas Check the repo's README. https://github.com/maozdemir/privateGPT/tree/gpu |
I encountered another issue: |
Unfortunately the README doesn't explain that very well, sorry. |
@Kaszanas probably sometrhing went wrong during the compilation of llama-cpp-python ,can you try uninstalling and installing back? |
@johnbrisbin can you use this wizard? https://pytorch.org/get-started/locally/ Also I'll read your comment when I have time, I'm not ignoring it.:) |
@maozdemir Compilation ran successfully, GPU ingest works as intended. This issue is only present when trying to run the privateGPT script. I could try and show you step by step but I don't know if I will be able to find the time. Will let you know if I do. |
@Kaszanas well the only time I saw that error was when I cloned the llama.cpp repo into wrong directory... I'll be waiting for your feedback . GPU ingesting is not related to llama-cpp-python package, or llama.cpp. It uses huggingface's CUDA implementation. llama.cpp uses cuBLAS, which is ran on |
I ran the commands straight from README. |
First of all, great contribution, was looking out for this and was excited to see someone put it together so quickly. Unfortunately I haven't got it to use my GPU. I've deleted and pulled everything so many times. Made sure to make adjustments to env and the script, made sure to pull and build following your instructions. Everything goes smoothly but it still uses my CPU instead of GPU. |
Are you on an NVidia GPU? |
Yes I am, currently a 12 GB 3060. I know you had to ask because there will always be someone who will try to run it on an Radeon Graphics Card lol. |
You are right, should add I am still investigating the issue you are having, testing on fresh Windows installs. @StephenDWright; when you launch the privateGPT.py do you see CUBLAS=1 or CUBLAS=0 at the bottom of model properties? |
@maozdemir I see Blas = 0. I am assuming you are referring to that. This is the output to the terminal. Thanks for taking the time to troubleshoot btw. Using embedded DuckDB with persistence: data will be stored in: db |
@StephenDWright you're welcome, this will help me with writing a better README too :) so thanks for your feedback. the possible cause is your llama-cpp-python was not compiled with CUBLAS. Can you try uninstalling the existing package and then with the current instructions? (with those environment variables etc) I am not sure why people are having troubles, I have actually ran on a clean Windows successfully, and also on several Linux machines... |
Before I do that, I did it again yesterday, this was some of the output while building after running this command: I took this output to mean it was compiling with CUBLAS. Extract of Terminal Output: Not searching for unused variables given on the command line. copying llama_cpp\llama.py -> _skbuild\win-amd64-3.11\cmake-install\llama_cpp\llama.py running install |
@StephenDWright alright, that doesn't seem to be the issue. Assuming that you already have CUDA drivers installed, the only thing that comes to my mind is torch |
Yes, I used that prior to commenting, and it worked. I was just pointing out an implicit requirement above the current privateGPT. I.e. a pre-2 pytorch worked but for GPU support and CUDA 11.8 you need the 2+ version. |
@StephenDWright I worked through a similar problem yesterday. The CUBLAS lines will show even when the GPU is not active with the right version of LlamaCpp running.
Those three things bit my hindquarters yesterday. |
Clean Windows? That is the definition of oxymoron. |
@johnbrisbin Thank you for the feedback. I am also trying to run it in VS code, in a venv. I have deleted the folder and environment and cloned so many times to start over the process.😤 I will try what you suggested regarding the cache. At least I know what I am looking for if it ever works. So you are saying using python3 and pip3 sometimes and then using python and pip can actually cause problems. Interesting. Thanks again. |
@StephenDWright, I would suggest you try 'where python' and 'where python3' in the venv terminal to check that. But for me, an active virtual environment seems to disable the where command so it outputs nothing. I had to run a simple script that imports sys and prints sys.argv[0] to find where the pythons are really located. And they were different. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This contribution is massive, the community has been asking for it. Thanks a lot! Please take a look at my comments and let me know if you feel it is ready to merge @maozdemir
Really brilliant, even though I am about to give up on getting GPU to work for now after an evening of trying, it is still a great addition. 👍👍 |
Some really good news. Just turning on the CUDA option made a huge improvement for me. I have a collection of 1900+ epub books. I have ingested them more than once. It took 15 hours straight to ingest 1500 of them on a 16 core/32 thread 64GB machine at about 100 per hour. It looks like your very short test was dominated by initialization time. With a real load (the whole 1900 books amount to 3.75 million chunks) the benefits are huge! 7x faster. Since the machine I have is very fast for CPU ops, the benefits for people with less capable main processors will be even better assuming a normal video card. Congratulations, @maozdemir |
Would this work with AMD GPUs if pytorch is configured with ROCm ? |
I looked into this recently and... indications are not good. There is not a one-to-one relationship between the CUDA and ROCm APIs so it looks like a simple translation is right out. |
@maozdemir and @johnbrisbin I finally got it to work with the GPU. Sharing so it can hopefully help with troubleshooting in the future. I encountered the following issue while setting up a virtual environment in VS Code: Problem: Despite manually preventing llama.cpp installation from the requirements file, installing version 1.5.4, and detecting CUDA references during the compilation process, the GPU was not being utilized. The nvidia toolkit was detected, and the compilation seemed successful, but Blas was at 0 and there were no indications of GPU offloading. Action: I found two folders in my environment's site packages: "llama_cpp" and "llama_cpp_python-0.1.54-py3.11-win-amd64.egg (0.1.54)." I deleted the "llama_cpp" folder and replaced it with the same folder from the "..win-amd..(0.1.5.4)" directory. It's unclear if copying the folder was necessary; deleting the folder might have resolved the issue alone. The GPU is now being utilized. Regardless of the exact fix, it's evident that the problem stemmed from using the incorrect version of "llama_cpp," despite my attempts to manually install the correct one. |
Thanks! @imartinez, I'll have to rewrite a good README for a clearer instructions to enable GPU, then it'll be ready to merge :) |
Thanks @StephenDWright for sharing your experience. And thanks @maozdemir! Let me know when you are done for a final review and merge. This first GPU support could be explicitly marked as experimental in the readme, and only for experienced users given the complexity of the installation. |
I made it work with my amd gpu rx 6950 xt |
Getting this message when attempting to GPU support:
Ubuntu 20.04 What am I missing? EDIT CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 On seperate lines, then it appears to build successfully. EDIT2: not sure if the above worked, I get the following when starting privateGPT.py:
|
Would you like to review the current state of this PR? (sorry for the force push...) |
If possible, could you verify if the GPU accelerated inference works with GPT4All? If this is not the case then adding additional information to the README might be needed. Last time i ran this and compiled stuff by hand the embedding ran fine with the GPU, but inference failed. Don't remember why. |
@Kaszanas, The thing is there are no way of making it work with GPT4All, at least not that I know. |
This pull request enables GPU to be used with privateGPT. (along with some
markdownlint
enhancements.)Fixes #121
Fixes #306