Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vigogne on CPU #31

Open
scenaristeur opened this issue Sep 22, 2023 · 12 comments
Open

Vigogne on CPU #31

scenaristeur opened this issue Sep 22, 2023 · 12 comments

Comments

@scenaristeur
Copy link

Hi is it possible to use vigogne with a CPU? for example with LocalAI ?

when i want to install Vigogne i got this message

pip install --no-build-isolation flash-attn
Defaulting to user installation because normal site-packages is not writeable
Collecting flash-attn
  Downloading flash_attn-2.2.4.post1.tar.gz (2.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 8.2 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... error
  error: subprocess-exited-with-error
  
  × python setup.py egg_info did not run successfully.
  │ exit code: 1
  ╰─> [20 lines of output]
      fatal: ni ceci ni aucun de ses répertoires parents n'est un dépôt git : .git
      /tmp/pip-install-psod8lyo/flash-attn_9087f455708048fcb436e3488114a325/setup.py:79: UserWarning: flash_attn was requested, but nvcc was not found.  Are you sure your environment has nvcc available?  If you're installing within a container from https://hub.docker.com/r/pytorch/pytorch, only images whose names contain 'devel' will provide nvcc.
        warnings.warn(
      Traceback (most recent call last):
        File "<string>", line 2, in <module>
        File "<pip-setuptools-caller>", line 34, in <module>
        File "/tmp/pip-install-psod8lyo/flash-attn_9087f455708048fcb436e3488114a325/setup.py", line 136, in <module>
          CUDAExtension(
        File "/home/smag/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1048, in CUDAExtension
          library_dirs += library_paths(cuda=True)
        File "/home/smag/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1179, in library_paths
          if (not os.path.exists(_join_cuda_home(lib_dir)) and
        File "/home/smag/.local/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2223, in _join_cuda_home
          raise EnvironmentError('CUDA_HOME environment variable is not set. '
      OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root.
      
      
      torch.__version__  = 2.0.1+cu117
      
      
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

@loranger
Copy link

Same issue, I try to run vigogne on a mac M1 Max

@LeMoussel
Copy link

See on Github flash-attention
Requirements:
CUDA 11.6 and above.

=> flash-attn can't run with CPU

@loranger
Copy link

It makes sense…
Though the readme displays

💡 The screencast below shows the current 🦙 Vigogne-7B-Chat model running on Apple M1 Pro using 4GB of weights (no sped up).

There must be a way, but only @bofenghuang knows it.

@tpaviot
Copy link

tpaviot commented Oct 12, 2023

You can use llama.cpp (https://github.com/ggerganov/llama.cpp) to run a ggml/gguf model available from TheBloke hf account (see https://huggingface.co/TheBloke). See for instance https://huggingface.co/TheBloke/Vigogne-2-7B-Chat-GGML. This works quite well, use Q5_K_M quantization for the best balance between performance/memory consumption/computation time.

BTW I just tested the Mistral AI model (https://huggingface.co/TheBloke/Kimiko-Mistral-7B-GGUF) which gives very good results in french.

@bofenghuang
Copy link
Owner

Hi,

Sorry for my late response.

Flash Attention is a powerful tool for accelerating training and inference and reducing memory usage when working with GPUs in PyTorch. Please note that it's compatible with GPU architectures newer than Amper, and its support for older architectures is limited (See the doc and this issue).

If you're planning to perform inference with llama.cpp, especially on a Mac machine, there's no need to install Flash Attention. As @tpaviot pointed out, you can find quantized versions of specific models on the Hugging Face Hub that are ready to use, thanks to the Bloke. Otherwise, you may need to quantize them by yourself (see the doc here).

PS: We've also released a Mistral-7B-based model here. Feel free to give it a try and share your feedback with us :)

@LeMoussel
Copy link

@tpaviot
Do you test TheBloke/Kimiko-Mistral-7B-GGUF on Colab ?
Will you have an example with results in french ?

@tpaviot
Copy link

tpaviot commented Oct 12, 2023

@LeMoussel yep, although llama.cpp can run on CPU, it support GPU as well, and it works very well on colab at an affordable cost (if you use T4), and can be smoothly integrated with langchain or llama_index. What do you mean exactly with "an example"?

@LeMoussel
Copy link

I haven't been able to run llama_index LlamaCPP with Mistral-7B-v0.1-GGUF/mistral-7b-v0.1.Q4_K_M.gguf on Colab with a T4 GPU. I still have BLAS = 0instead of BLAS = 1
If you have an example of code, I'm interested.

@tpaviot
Copy link

tpaviot commented Oct 13, 2023

This is what I use to install the latest llama-cpp-python using pip on colab:

if HAVE_CUDA:
    print("cuda available, build llama-cpp-python on GPU using CUBLAS")
    os.environ["CMAKE_ARGS"] = "-DLLAMA_CUBLAS=ON"
else:
    print("Build llama-cpp-python on CPU using OPENBLAS")
    os.environ["CMAKE_ARGS"] = "-DLLAMA_QKK_64=1 -DCMAKE_BUILD_TYPE=Release -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"
!CMAKE_ARGS=${CMAKE_ARGS} FORCE_CMAKE=1 pip install -v llama-cpp-python

@LeMoussel
Copy link

On COLAB HAVE_CUDAis not defined.
I replaced it with this

import os
import torch

if torch.cuda.is_available():
    print("cuda available, build llama-cpp-python on GPU using CUBLAS")
    os.environ["CMAKE_ARGS"] = "-DLLAMA_CUBLAS=ON"
else:
    print("Build llama-cpp-python on CPU using OPENBLAS")
    os.environ["CMAKE_ARGS"] = "-DLLAMA_QKK_64=1 -DCMAKE_BUILD_TYPE=Release -DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS"

!CMAKE_ARGS=${CMAKE_ARGS} FORCE_CMAKE=1 pip install -v llama-cpp-python

@mauryaland
Copy link

PS: We've also released a Mistral-7B-based model here. Feel free to give it a try and share your feedback with us :)

Hi @bofenghuang, excellent news, thanks for your great work! I have a question concerning this model. May I fine-tune with my own French instruction dataset directly from Vigostral or should I start from the pre-trained Mistral-7B? Thanks in advance for your reply.

@bofenghuang
Copy link
Owner

Hello @mauryaland, I would recommend beginning with the pre-trained Mistral-7B model if you have sufficient instruction data. Otherwise, you can consider using Vigostral, which has already undergone fine-tuning on French instructions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants