runner-llm-1  | 2024-12-20 20:30:05,626 INFO:     Started server process [1]
runner-llm-1  | 2024-12-20 20:30:05,626 INFO:     Waiting for application startup.
runner-llm-1  | 2024-12-20 20:30:05,636 - app.utils.hardware - INFO - NVML initialized successfully.
runner-llm-1  | 2024-12-20 20:30:10,160 - app.pipelines.llm - INFO - Initializing LLM pipeline
runner-llm-1  | 2024-12-20 20:30:10,162 - app.pipelines.llm - INFO - Model has 32 attention heads and 32 layers
runner-llm-1  | 2024-12-20 20:30:10,162 - app.pipelines.llm - INFO - Using tensor parallel size: 1
runner-llm-1  | 2024-12-20 20:30:10,162 - app.pipelines.llm - INFO - Using pipeline parallel size: 4
runner-llm-1  | 2024-12-20 20:30:10,162 - app.pipelines.llm - INFO - Total GPUs used: 4
runner-llm-1  | 2024-12-20 20:30:10,169 - app.pipelines.llm - INFO - Using BFloat16 precision
runner-llm-1  | INFO 12-20 20:30:15 config.py:478] This model supports multiple tasks: {'classify', 'score', 'reward', 'embed', 'generate'}. Defaulting to 'generate'.
runner-llm-1  | INFO 12-20 20:30:15 config.py:1216] Defaulting to use mp for distributed inference
runner-llm-1  | WARNING 12-20 20:30:15 config.py:596] Async output processing can not be enabled with pipeline parallel
runner-llm-1  | INFO 12-20 20:30:15 llm_engine.py:249] Initializing an LLM engine (v0.6.5) with config: model='/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', speculative_config=None, tokenizer='/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=4, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/models/models--meta-llama--Meta-Llama-3.1-8B-Instruct/snapshots/0e9e39f249a16976918f6564b8830bc894c89659, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=True, chunked_prefill_enabled=False, use_async_output_proc=False, mm_cache_preprocessor=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output"],"candidate_compile_sizes":[],"compile_sizes":[],"capture_sizes":[128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"max_capture_size":128}, use_cached_outputs=False,
runner-llm-1  | WARNING 12-20 20:30:15 multiproc_worker_utils.py:312] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
runner-llm-1  | INFO 12-20 20:30:15 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
runner-llm-1  | INFO 12-20 20:30:16 selector.py:120] Using Flash Attention backend.
runner-llm-1  | (VllmWorkerProcess pid=260) INFO 12-20 20:30:20 selector.py:120] Using Flash Attention backend.
runner-llm-1  | (VllmWorkerProcess pid=260) INFO 12-20 20:30:20 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
runner-llm-1  | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 selector.py:120] Using Flash Attention backend.
runner-llm-1  | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
runner-llm-1  | (VllmWorkerProcess pid=258) INFO 12-20 20:30:20 selector.py:120] Using Flash Attention backend.
runner-llm-1  | (VllmWorkerProcess pid=258) INFO 12-20 20:30:20 multiproc_worker_utils.py:222] Worker ready; awaiting tasks
runner-llm-1  | INFO 12-20 20:30:20 utils.py:922] Found nccl from library libnccl.so.2
runner-llm-1  | (VllmWorkerProcess pid=260) INFO 12-20 20:30:20 utils.py:922] Found nccl from library libnccl.so.2
runner-llm-1  | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 utils.py:922] Found nccl from library libnccl.so.2
runner-llm-1  | (VllmWorkerProcess pid=258) INFO 12-20 20:30:20 utils.py:922] Found nccl from library libnccl.so.2
runner-llm-1  | INFO 12-20 20:30:20 pynccl.py:69] vLLM is using nccl==2.21.5
runner-llm-1  | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 pynccl.py:69] vLLM is using nccl==2.21.5
runner-llm-1  | (VllmWorkerProcess pid=260) INFO 12-20 20:30:20 pynccl.py:69] vLLM is using nccl==2.21.5
runner-llm-1  | (VllmWorkerProcess pid=258) INFO 12-20 20:30:20 pynccl.py:69] vLLM is using nccl==2.21.5
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] Exception in worker VllmWorkerProcess while processing method init_device.
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] Traceback (most recent call last):
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/multiproc_worker_utils.py", line 230, in _run_worker_process
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     output = executor(*args, **kwargs)
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]              ^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/worker/worker.py", line 148, in init_device
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     init_worker_distributed_environment(self.vllm_config, self.rank,
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     initialize_model_parallel(tensor_model_parallel_size,
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 1062, in initialize_model_parallel
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     _PP = init_model_parallel_group(group_ranks,
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     return GroupCoordinator(
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]            ^^^^^^^^^^^^^^^^^
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 216, in __init__
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     self.pynccl_comm = PyNcclCommunicator(
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]                        ^^^^^^^^^^^^^^^^^^^
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236]     raise RuntimeError(f"NCCL error: {error_str}")
runner-llm-1  | (VllmWorkerProcess pid=259) ERROR 12-20 20:30:20 multiproc_worker_utils.py:236] RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
runner-llm-1  | 2024-12-20 20:30:20,913 ERROR:    Traceback (most recent call last):
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/starlette/routing.py", line 693, in lifespan
runner-llm-1  |     async with self.lifespan_context(app) as maybe_state:
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/contextlib.py", line 210, in __aenter__
runner-llm-1  |     return await anext(self.gen)
runner-llm-1  |            ^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/app/app/main.py", line 26, in lifespan
runner-llm-1  |     app.pipeline = load_pipeline(pipeline, model_id)
runner-llm-1  |                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/app/app/main.py", line 68, in load_pipeline
runner-llm-1  |     return LLMPipeline(model_id)
runner-llm-1  |            ^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/app/app/pipelines/llm.py", line 146, in __init__
runner-llm-1  |     self.engine = AsyncLLMEngine.from_engine_args(engine_args)
runner-llm-1  |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 707, in from_engine_args
runner-llm-1  |     engine = cls(
runner-llm-1  |              ^^^^
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 594, in __init__
runner-llm-1  |     self.engine = self._engine_class(*args, **kwargs)
runner-llm-1  |                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 267, in __init__
runner-llm-1  |     super().__init__(*args, **kwargs)
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 288, in __init__
runner-llm-1  |     self.model_executor = executor_class(vllm_config=vllm_config, )
runner-llm-1  |                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 180, in __init__
runner-llm-1  |     super().__init__(*args, **kwargs)
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 26, in __init__
runner-llm-1  |     super().__init__(*args, **kwargs)
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 36, in __init__
runner-llm-1  |     self._init_executor()
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 82, in _init_executor
runner-llm-1  |     self._run_workers("init_device")
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 157, in _run_workers
runner-llm-1  |     driver_worker_output = driver_worker_method(*args, **kwargs)
runner-llm-1  |                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/worker/worker.py", line 148, in init_device
runner-llm-1  |     init_worker_distributed_environment(self.vllm_config, self.rank,
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/worker/worker.py", line 460, in init_worker_distributed_environment
runner-llm-1  |     ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 1101, in ensure_model_parallel_initialized
runner-llm-1  |     initialize_model_parallel(tensor_model_parallel_size,
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 1062, in initialize_model_parallel
runner-llm-1  |     _PP = init_model_parallel_group(group_ranks,
runner-llm-1  |           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 876, in init_model_parallel_group
runner-llm-1  |     return GroupCoordinator(
runner-llm-1  |            ^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 216, in __init__
runner-llm-1  |     self.pynccl_comm = PyNcclCommunicator(
runner-llm-1  |                        ^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl.py", line 99, in __init__
runner-llm-1  |     self.comm: ncclComm_t = self.nccl.ncclCommInitRank(
runner-llm-1  |                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 275, in ncclCommInitRank
runner-llm-1  |     self.NCCL_CHECK(self._funcs["ncclCommInitRank"](ctypes.byref(comm),
runner-llm-1  |   File "/root/.pyenv/versions/3.11.11/lib/python3.11/site-packages/vllm/distributed/device_communicators/pynccl_wrapper.py", line 254, in NCCL_CHECK
runner-llm-1  |     raise RuntimeError(f"NCCL error: {error_str}")
runner-llm-1  | RuntimeError: NCCL error: unhandled system error (run with NCCL_DEBUG=INFO for details)
runner-llm-1  |
runner-llm-1  | INFO 12-20 20:30:20 multiproc_worker_utils.py:140] Terminating local vLLM worker processes
runner-llm-1  | (VllmWorkerProcess pid=259) INFO 12-20 20:30:20 multiproc_worker_utils.py:247] Worker exiting
runner-llm-1  | 2024-12-20 20:30:20,931 ERROR:    Application startup failed. Exiting.
runner-llm-1  | 2024-12-20 20:30:20,946 - app.utils.hardware - INFO - NVML shutdown successfully.
runner-llm-1 exited with code 0