You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The output of `python env.py`
```text
python env.py
Collecting environment information...
PyTorch version: 2.4.1+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A
OS: Microsoft Windows 11 Pro
GCC version: (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders, r8) 13.2.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: N/A
Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla P40
Nvidia driver version: 551.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.1+cu124
[pip3] torchaudio==2.4.1
[pip3] torchvision==0.20.1+cu124
[pip3] transformers==4.45.2
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
Aphrodite Version: 0.6.4
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect```
🐛 Describe the bug
aphrodite run .\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\ --dtype=float16 --host http://127.0.0.1 --port 8888 --gpu-memory-utilization0.8
WARNING: gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO: Multiprocessing frontend to use tcp://127.0.0.1:51657 for RPC Path.
INFO: Started engine process with PID 83388
W:\windows_cuda\aphrodite-engine\venv\lib\site-packages\zmq\_future.py:724: RuntimeWarning: Proactor event loop does not implement add_reader family of methods required for zmq. Registering an additional selector thread for add_reader support via tornado. Use `asyncio.set_event_loop_policy(WindowsSelectorEventLoopPolicy())` to avoid this warning.
self._get_loop()
WARNING: Casting torch.bfloat16 to torch.float16.
WARNING: gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO: -------------------------------------------------------------------------------------
INFO: Initializing Aphrodite Engine (v0.6.3.post1 commit f0e00f1b) with the following config:
INFO: Model = '.\\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\\'
INFO: DataType = torch.float16
INFO: Tensor Parallel Size = 1
INFO: Pipeline Parallel Size = 1
INFO: Disable Custom All-Reduce = False
INFO: Quantization Format = 'gptq'
INFO: Context Length = 32768
INFO: Enforce Eager Mode = False
INFO: Prefix Caching = False
INFO: Device = device(type='cuda')
INFO: Guided Decoding Backend = DecodingConfig(guided_decoding_backend='lm-format-enforcer')
INFO: -------------------------------------------------------------------------------------
INFO: Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO: Using XFormers backend.
[W1128 01:03:25.000000000 socket.cpp:697] [c10d] The client socket has failed to connect to [SORANET]:51675 (system error: 10049 - The requested address is not valid in its context.).
INFO: Loading model .\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\...
INFO: Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO: Using XFormers backend.
⠏ Loading model weights... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 100% 9.31/9.31 GiB 0:00:07
INFO: Model weights loaded in 8.86 seconds.
INFO: Total model weights memory usage: 9.38 GiB
INFO: Profiling peak memory usage...
it stuck at profiling peak memory usage.
The text was updated successfully, but these errors were encountered:
Your current environment
The output of `python env.py`
```text python env.py Collecting environment information... PyTorch version: 2.4.1+cu124 Is debug build: False CUDA used to build PyTorch: 12.4 ROCM used to build PyTorch: N/AOS: Microsoft Windows 11 Pro
GCC version: (MinGW-W64 x86_64-ucrt-posix-seh, built by Brecht Sanders, r8) 13.2.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: N/A
Python version: 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] (64-bit runtime)
Python platform: Windows-10-10.0.22631-SP0
Is CUDA available: True
CUDA runtime version: 12.4.131
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla P40
Nvidia driver version: 551.78
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture=9
CurrentClockSpeed=4201
DeviceID=CPU0
Family=107
L2CacheSize=16384
L2CacheSpeed=
Manufacturer=AuthenticAMD
MaxClockSpeed=4201
Name=AMD Ryzen 9 7950X3D 16-Core Processor
ProcessorType=3
Revision=24834
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] pyzmq==26.2.0
[pip3] torch==2.4.1+cu124
[pip3] torchaudio==2.4.1
[pip3] torchvision==0.20.1+cu124
[pip3] transformers==4.45.2
[pip3] triton==3.1.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
Aphrodite Version: 0.6.4
Aphrodite Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
Could not collect```
🐛 Describe the bug
aphrodite run .\SakuraLLM.Sakura-14B-Qwen2.5-v1.0-GPTQ-Int4\ --dtype=float16 --host http://127.0.0.1 --port 8888 --gpu-memory-utilization0.8
it stuck at profiling peak memory usage.
The text was updated successfully, but these errors were encountered: