runner-llm-1 | 2024-12-20 20:30:05,626 INFO: Started server process [1] runner-llm-1 | 2024-12-20 20:30:05,626 INFO: Waiting for application startup. runner-llm-1 | 2024-12-20 20:30:05,636 - app.utils.hardware - INFO - NVML initialized successfully. runner-llm-1 | 2024-12-20 20:30:10,160 - app.pipelines.llm - INFO - Initializing LLM pipeline runner-llm-1 | 2024-12-20 20:30:10,162 - app.pipelines.llm - INFO - Model has 32 attention heads and 32 layers runner-llm-1 | 2024-12-20 20:30:10,162 - app.pipelines.llm - INFO - Using tensor parallel size: 1 runner-llm-1 | 2024-12-20 20:30:10,162 - app.pipelines.llm - INFO - Using pipeline parallel size: 4 runner-llm-1 | 2024-12-20 20:30:10,162 - app.pipelines.llm - INFO - Total GPUs used: 4 runner-llm-1 | 2024-12-20 20:30:10,169 - app.pipelines.llm - INFO - Using BFloat16 precision runner-llm-1 | INFO 12-20 20:30:15 config.py:478] This model supports multiple tasks: {'classify', 'score', 'reward', 'embed', 'generate'}. Defaulting to 'generate'. runner-llm-1 | INFO 12-20 20:30:15 config.py:1216] Defaulting to use mp for distributed inference runner-llm-1 | WARNING 12-20 20:30:15 config.py:596] Async output processing can not be enabled with pipeline parallel runner-llm-1 | INFO 12-20 20:30:15 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', speculative_config=None, tokenizer='/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":128}, use_cached_outputs=False, runner-llm-1 | WARNING 12-20 20:30:15 multiproc_worker_utils.py:312] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. runner-llm-1 | INFO 12-20 20:30:15 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager runner-llm-1 | INFO 12-20 20:30:16 selector.py:120] Using Flash Attention backend. runner-llm-1 | (VllmWorkerProcess pid=260) INFO 12-20 20:30:20 selector.py:120] Using Flash Attention backend. runner-llm-1 | (VllmWorkerProcess pid=260) INFO 12-20 20:30:20 multiproc_worker_utils.py:222] Worker ready; awaiting tasks runner-llm-1 | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 selector.py:120] Using Flash Attention backend. runner-llm-1 | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 multiproc_worker_utils.py:222] Worker ready; awaiting tasks runner-llm-1 | (VllmWorkerProcess pid=258) INFO 12-20 20:30:20 selector.py:120] Using Flash Attention backend. runner-llm-1 | (VllmWorkerProcess pid=258) INFO 12-20 20:30:20 multiproc_worker_utils.py:222] Worker ready; awaiting tasks runner-llm-1 | INFO 12-20 20:30:20 utils.py:922] Found nccl from library libnccl.so.2 runner-llm-1 | (VllmWorkerProcess pid=260) INFO 12-20 20:30:20 utils.py:922] Found nccl from library libnccl.so.2 runner-llm-1 | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 utils.py:922] Found nccl from library libnccl.so.2 runner-llm-1 | (VllmWorkerProcess pid=258) INFO 12-20 20:30:20 utils.py:922] Found nccl from library libnccl.so.2 runner-llm-1 | INFO 12-20 20:30:20 pynccl.py:69] vLLM is using nccl==2.21.5 runner-llm-1 | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 pynccl.py:69] vLLM is using nccl==2.21.5 runner-llm-1 | (VllmWorkerProcess pid=260) INFO 12-20 20:30:20 pynccl.py:69] vLLM is using nccl==2.21.5 runner-llm-1 | (VllmWorkerProcess pid=258) INFO 12-20 20:30:20 pynccl.py:69] vLLM is using nccl==2.21.5 runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] Exception in worker VllmWorkerProcess while processing method init_device. runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] Traceback (most recent call last): runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 230, in _run_worker_process runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] output = executor(*args, **kwargs) runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/worker/worker.py", line 148, in init_device runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] init_worker_distributed_environment(self.vllm_config, self.rank, runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] initialize_model_parallel(tensor_model_parallel_size, runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 1062, in initialize_model_parallel runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] _PP = init_model_parallel_group(group_ranks, runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] return GroupCoordinator( runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^ runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 216, in __init__ runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] self.pynccl_comm = PyNcclCommunicator( runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^ runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__ runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] self.comm: ncclComm_t = self.nccl.ncclCommInitRank( runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] raise RuntimeError(f"NCCL error: {error_str}") runner-llm-1 | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) runner-llm-1 | 2024-12-20 20:30:20,913 ERROR: Traceback (most recent call last): runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/starlette/routing.py", line 693, in lifespan runner-llm-1 | async with self.lifespan_context(app) as maybe_state: runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/contextlib.py", line 210, in __aenter__ runner-llm-1 | return await anext(self.gen) runner-llm-1 | ^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/app/app/main.py", line 26, in lifespan runner-llm-1 | app.pipeline = load_pipeline(pipeline, model_id) runner-llm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/app/app/main.py", line 68, in load_pipeline runner-llm-1 | return LLMPipeline(model_id) runner-llm-1 | ^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/app/app/pipelines/llm.py", line 146, in __init__ runner-llm-1 | self.engine = AsyncLLMEngine.from_engine_args(engine_args) runner-llm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 707, in from_engine_args runner-llm-1 | engine = cls( runner-llm-1 | ^^^^ runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 594, in __init__ runner-llm-1 | self.engine = self._engine_class(*args, **kwargs) runner-llm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 267, in __init__ runner-llm-1 | super().__init__(*args, **kwargs) runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__ runner-llm-1 | self.model_executor = executor_class(vllm_config=vllm_config, ) runner-llm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 180, in __init__ runner-llm-1 | super().__init__(*args, **kwargs) runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__ runner-llm-1 | super().__init__(*args, **kwargs) runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__ runner-llm-1 | self._init_executor() runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 82, in _init_executor runner-llm-1 | self._run_workers("init_device") runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers runner-llm-1 | driver_worker_output = driver_worker_method(*args, **kwargs) runner-llm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/worker/worker.py", line 148, in init_device runner-llm-1 | init_worker_distributed_environment(self.vllm_config, self.rank, runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment runner-llm-1 | ensure_model_parallel_initialized(parallel_config.tensor_parallel_size, runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized runner-llm-1 | initialize_model_parallel(tensor_model_parallel_size, runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 1062, in initialize_model_parallel runner-llm-1 | _PP = init_model_parallel_group(group_ranks, runner-llm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group runner-llm-1 | return GroupCoordinator( runner-llm-1 | ^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 216, in __init__ runner-llm-1 | self.pynccl_comm = PyNcclCommunicator( runner-llm-1 | ^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__ runner-llm-1 | self.comm: ncclComm_t = self.nccl.ncclCommInitRank( runner-llm-1 | ^^^^^^^^^^^^^^^^^^^^^^^^^^^ runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank runner-llm-1 | self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm), runner-llm-1 | File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK runner-llm-1 | raise RuntimeError(f"NCCL error: {error_str}") runner-llm-1 | RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details) runner-llm-1 | runner-llm-1 | INFO 12-20 20:30:20 multiproc_worker_utils.py:140] Terminating local vLLM worker processes runner-llm-1 | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 multiproc_worker_utils.py:247] Worker exiting runner-llm-1 | 2024-12-20 20:30:20,931 ERROR: Application startup failed. Exiting. runner-llm-1 | 2024-12-20 20:30:20,946 - app.utils.hardware - INFO - NVML shutdown successfully. runner-llm-1 exited with code 0