Error with vLLM docker container `vllm/vllm-openai:v0.3.0` #2773

sarahwooders · 2024-02-05T21:06:31Z

I am trying to deploy vLLM on k8 with the following deployment YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 1 # You can scale this up to 10
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      tolerations:
       - key: cloud.google.com/gke-spot
         operator: Equal
         value: "true"
         effect: NoSchedule
      nodeSelector:
        cloud.google.com/gke-spot: "true"
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      containers:
      - name: vllm-container
        image: vllm/vllm-openai:latest
        args: ["--model", "ehartford/dolphin-2.5-mixtral-8x7b", "--host", "0.0.0.0", "--tensor-parallel-size", "8"]
        # check the health of the container by hitting port 8000/health endpoint
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 1
        ports:
        - containerPort: 8000
          name: vllm-port
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          value: ${HUGGING_FACE_HUB_TOKEN}
        resources:
          limits:
            nvidia.com/gpu: 8 # Requesting one GPU per pod
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm

Although this deployment worked fine with previous versions of the vLLM docker container, on the latest version, the following error occurs after the model is finished loading:

File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main0:00, 92.7MB/s]
    return _run_code(code, main_globals, None,   | 1.43G/4.22G [00:39<01:01, 45.1MB/s]
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code2G [01:21<00:00, 71.5MB/s]
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/vllm/engine/async_llm_engine.py", line 364, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/vllm/engine/llm_engine.py", line 114, in __init__
    self._init_cache()
  File "/workspace/vllm/engine/llm_engine.py", line 308, in _init_cache
    num_blocks = self._run_workers(
  File "/workspace/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 116, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 599, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 534, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 347, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 319, in forward
    hidden_states, residual = layer(positions, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 283, in forward
    hidden_states = self.block_sparse_moe(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 137, in forward
    final_hidden_states = fused_moe(hidden_states,
  File "/workspace/vllm/model_executor/layers/fused_moe.py", line 270, in fused_moe
    invoke_fused_moe_kernel(hidden_states, w1, intermediate_cache1,
  File "/workspace/vllm/model_executor/layers/fused_moe.py", line 187, in invoke_fused_moe_kernel
    fused_moe_kernel[grid](
  File "<string>", line 63, in fused_moe_kernel
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 425, in compile
    so_path = make_stub(name, signature, constants)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/make_launcher.py", line 39, in make_stub
    so = _build(name, src_path, tmpdir)
  File "/usr/local/lib/python3.10/dist-packages/triton/common/build.py", line 61, in _build
    cuda_lib_dirs = libcuda_dirs()
  File "/usr/local/lib/python3.10/dist-packages/triton/common/build.py", line 30, in libcuda_dirs
    assert any(os.path.exists(os.path.join(path, 'libcuda.so')) for path in dirs), msg
AssertionError: libcuda.so cannot found!

The text was updated successfully, but these errors were encountered:

alsichcan · 2024-02-13T13:29:40Z

Also having the issue with docker deployment of vLLM.

I have pulled the v0.3.0 image from dockerhub and created the container with following options

docker run -d --name graph_llm \
    --runtime nvidia \
    --gpus '"device=1,2"' \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 2 \
    --engine-use-ray

It worked just fine with v0.2.7 of vllm with following docker logs

INFO 02-13 13:10:55 api_server.py:727] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, engine_use_ray=True, disable_log_requests=False, max_log_len=None)
2024-02-13 13:10:57,002 INFO worker.py:1724 -- Started a local Ray instance.
(_AsyncLLMEngine pid=3644) Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
(_AsyncLLMEngine pid=3644) [W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
(_AsyncLLMEngine pid=3644) INFO 02-13 13:10:59 llm_engine.py:70] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer='mistralai/Mistral-7B-Instruct-v0.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
(_AsyncLLMEngine pid=3644) INFO 02-13 13:11:10 llm_engine.py:275] # GPU blocks: 33269, # CPU blocks: 4096
(_AsyncLLMEngine pid=3644) INFO 02-13 13:11:11 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(_AsyncLLMEngine pid=3644) INFO 02-13 13:11:11 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
(_AsyncLLMEngine pid=3644) INFO 02-13 13:11:46 model_runner.py:547] Graph capturing finished in 35 secs.
(RayWorkerVllm pid=3764) INFO 02-13 13:11:11 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=3764) INFO 02-13 13:11:11 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 02-13 13:11:46 api_server.py:121] Using default chat template:
INFO 02-13 13:11:46 api_server.py:121] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

However, with the same environment v0.3.0 would raise CUDA error with following docker logs:

2024-02-13 13:15:48,622 INFO worker.py:1724 -- Started a local Ray instance.
(_AsyncLLMEngine pid=3642) Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 218, in <module>
    openai_serving_chat = OpenAIServingChat(engine, served_model,
  File "/workspace/vllm/entrypoints/openai/serving_chat.py", line 26, in __init__
    super().__init__(engine=engine, served_model=served_model)
  File "/workspace/vllm/entrypoints/openai/serving_engine.py", line 34, in __init__
    asyncio.run(self._post_init())
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/workspace/vllm/entrypoints/openai/serving_engine.py", line 37, in _post_init
    engine_model_config = await self.engine.get_model_config()
  File "/workspace/vllm/engine/async_llm_engine.py", line 607, in get_model_config
    return await self.engine.get_model_config.remote()
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_AsyncLLMEngine.__init__() (pid=3642, ip=172.17.0.3, actor_id=7c13edc0c104d61a0fff650901000000, repr=<vllm.engine.async_llm_engine._AsyncLLMEngine object at 0x7fb6b4440c10>)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/workspace/vllm/engine/llm_engine.py", line 114, in __init__
    self._init_cache()
  File "/workspace/vllm/engine/llm_engine.py", line 345, in _init_cache
    self._run_workers("warm_up_model")
  File "/workspace/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/workspace/vllm/worker/worker.py", line 148, in warm_up_model
    self.model_runner.capture_model(self.gpu_cache)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 685, in capture_model
    graph_runner.capture(
  File "/workspace/vllm/worker/model_runner.py", line 732, in capture
    hidden_states = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mistral.py", line 303, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mistral.py", line 256, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mistral.py", line 214, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mistral.py", line 77, in forward
    gate_up, _ = self.gate_up_proj(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/layers/linear.py", line 211, in forward
    output_parallel = self.linear_method.apply_weights(
  File "/workspace/vllm/model_executor/layers/linear.py", line 72, in apply_weights
    return F.linear(x, weight, bias)
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

My server has 'RTX 6000 Ada Generation D6 48GB', which supports compute capability with 11.8, 12.0 - 12.4, therefore I think it wouldn't be an issue with GPU.

I guess it's the compatibility issue with Ray, as the Ray version requirement was updated at Jan 29, 2024 in commit 7d64841 following an issue #2636.

I'd really appreciate it if you could take a look at this issue along with @sarahwooders's issue.
Thanks a bunch for your help!

hmellor · 2024-08-28T19:24:37Z

Solved by #2845

simon-mo self-assigned this Feb 5, 2024

simon-mo mentioned this issue Feb 13, 2024

Fix docker python version #2845

Merged

ChengjieLi28 mentioned this issue Mar 11, 2024

BUG：使用vllm部署qwen1.5-chat 72b模型出错 xorbitsai/inference#1122

Closed

hmellor closed this as completed Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error with vLLM docker container `vllm/vllm-openai:v0.3.0` #2773

Error with vLLM docker container `vllm/vllm-openai:v0.3.0` #2773

sarahwooders commented Feb 5, 2024 •

edited by simon-mo

Loading

alsichcan commented Feb 13, 2024 •

edited

Loading

hmellor commented Aug 28, 2024

Error with vLLM docker container vllm/vllm-openai:v0.3.0 #2773

Error with vLLM docker container vllm/vllm-openai:v0.3.0 #2773

Comments

sarahwooders commented Feb 5, 2024 • edited by simon-mo Loading

alsichcan commented Feb 13, 2024 • edited Loading

hmellor commented Aug 28, 2024

Error with vLLM docker container `vllm/vllm-openai:v0.3.0` #2773

Error with vLLM docker container `vllm/vllm-openai:v0.3.0` #2773

sarahwooders commented Feb 5, 2024 •

edited by simon-mo

Loading

alsichcan commented Feb 13, 2024 •

edited

Loading