[Bug]: loading a GPTQ-INT4 model on windows with a P40 #847

sorasoras · 2024-11-27T17:07:11Z

Your current environment

The output of `python env.py`

```text python env.py Collecting environment information... PyTorch version: 2.4.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/A

OS: Microsoft Windows 11 Pro
GCC version: (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders, r8) 13.2.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: N/A

Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla P40
Nvidia driver version: 551.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture=9
CurrentClockSpeed=4201
DeviceID=CPU0
Family=107
L2CacheSize=16384
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=4201
Name=AMD Ryzen 9 7950X3D 16-Core Processor
ProcessorType=3
Revision=24834

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.1+cu124
[pip3] torchaudio==2.4.1
[pip3] torchvision==0.20.1+cu124
[pip3] transformers==4.45.2
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
Aphrodite Version: 0.6.4
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect```

🐛 Describe the bug

aphrodite run .\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\ --dtype=float16 --host http://127.0.0.1 --port 8888 --gpu-memory-utilization0.8

WARNING:  gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     Multiprocessing frontend to use tcp://127.0.0.1:51657 for RPC Path.
INFO:     Started engine process with PID 83388
W:\windows_cuda\aphrodite-engine\venv\lib\site-packages\zmq\_future.py:724: RuntimeWarning: Proactor event loop does not implement add_reader family of methods required for zmq. Registering an additional selector thread for add_reader support via tornado. Use `asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())` to avoid this warning.
  self._get_loop()
WARNING:  Casting torch.bfloat16 to torch.float16.
WARNING:  gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO:     -------------------------------------------------------------------------------------
INFO:     Initializing Aphrodite Engine (v0.6.3.post1 commit f0e00f1b) with the following config:
INFO:     Model = '.\\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\\'
INFO:     DataType = torch.float16
INFO:     Tensor Parallel Size = 1
INFO:     Pipeline Parallel Size = 1
INFO:     Disable Custom All-Reduce = False
INFO:     Quantization Format = 'gptq'
INFO:     Context Length = 32768
INFO:     Enforce Eager Mode = False
INFO:     Prefix Caching = False
INFO:     Device = device(type='cuda')
INFO:     Guided Decoding Backend = DecodingConfig(guided_decoding_backend='lm-format-enforcer')
INFO:     -------------------------------------------------------------------------------------
INFO:     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO:     Using XFormers backend.
[W1128 01:03:25.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [SORANET]:51675 (system error: 10049 - The requested address is not valid in its context.).
INFO:     Loading model .\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\...
INFO:     Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO:     Using XFormers backend.
⠏ Loading model weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 9.31/9.31 GiB 0:00:07
INFO:     Model weights loaded in 8.86 seconds.
INFO:     Total model weights memory usage: 9.38 GiB
INFO:     Profiling peak memory usage...

it stuck at profiling peak memory usage.

The text was updated successfully, but these errors were encountered:

sorasoras added the bug Something isn't working label Nov 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: loading a GPTQ-INT4 model on windows with a P40 #847

[Bug]: loading a GPTQ-INT4 model on windows with a P40 #847

sorasoras commented Nov 27, 2024

[Bug]: loading a GPTQ-INT4 model on windows with a P40 #847

[Bug]: loading a GPTQ-INT4 model on windows with a P40 #847

Comments

sorasoras commented Nov 27, 2024

Your current environment

🐛 Describe the bug