Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: loading a GPTQ-INT4 model on windows with a P40 #847

Open
sorasoras opened this issue Nov 27, 2024 · 0 comments
Open

[Bug]: loading a GPTQ-INT4 model on windows with a P40 #847

sorasoras opened this issue Nov 27, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@sorasoras
Copy link

Your current environment

The output of `python env.py` ```text python env.py Collecting environment information... PyTorch version: 2.4.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Pro
GCC version: (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders, r8) 13.2.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: N/A

Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla P40
Nvidia driver version: 551.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=4201
DeviceID=CPU0
Family=107
L2CacheSize=16384
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=4201
Name=AMD Ryzen 9 7950X3D 16-Core Processor
ProcessorType=3
Revision=24834

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.1+cu124
[pip3] torchaudio==2.4.1
[pip3] torchvision==0.20.1+cu124
[pip3] transformers==4.45.2
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
Aphrodite Version: 0.6.4
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect```

🐛 Describe the bug

aphrodite run .\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\ --dtype=float16 --host http://127.0.0.1 --port 8888 --gpu-memory-utilization0.8

WARNING:  gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     Multiprocessing frontend to use tcp://127.0.0.1:51657 for RPC Path.
INFO:     Started engine process with PID 83388
W:\windows_cuda\aphrodite-engine\venv\lib\site-packages\zmq\_future.py:724: RuntimeWarning: Proactor event loop does not implement add_reader family of methods required for zmq. Registering an additional selector thread for add_reader support via tornado. Use `asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())` to avoid this warning.
  self._get_loop()
WARNING:  Casting torch.bfloat16 to torch.float16.
WARNING:  gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     -------------------------------------------------------------------------------------
INFO:     Initializing Aphrodite Engine (v0.6.3.post1 commit f0e00f1b) with the following config:
INFO:     Model = '.\\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\\'
INFO:     DataType = torch.float16
INFO:     Tensor Parallel Size = 1
INFO:     Pipeline Parallel Size = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = 'gptq'
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     Prefix Caching = False
INFO:     Device = device(type='cuda')
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='lm-format-enforcer')
INFO:     -------------------------------------------------------------------------------------
INFO:     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO:     Using XFormers backend.
[W1128 01:03:25.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [SORANET]:51675 (system error: 10049 - The requested address is not valid in its context.).
INFO:     Loading model .\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\...
INFO:     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO:     Using XFormers backend.
⠏ Loading model weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 9.31/9.31 GiB 0:00:07
INFO:     Model weights loaded in 8.86 seconds.
INFO:     Total model weights memory usage: 9.38 GiB
INFO:     Profiling peak memory usage...

it stuck at profiling peak memory usage.

@sorasoras sorasoras added the bug Something isn't working label Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant