Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with vLLM docker container vllm/vllm-openai:v0.3.0 #2773

Closed
sarahwooders opened this issue Feb 5, 2024 · 2 comments
Closed

Error with vLLM docker container vllm/vllm-openai:v0.3.0 #2773

sarahwooders opened this issue Feb 5, 2024 · 2 comments
Assignees

Comments

@sarahwooders
Copy link

sarahwooders commented Feb 5, 2024

I am trying to deploy vLLM on k8 with the following deployment YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-deployment
spec:
  replicas: 1 # You can scale this up to 10
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      tolerations:
       - key: cloud.google.com/gke-spot
         operator: Equal
         value: "true"
         effect: NoSchedule
      nodeSelector:
        cloud.google.com/gke-spot: "true"
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      containers:
      - name: vllm-container
        image: vllm/vllm-openai:latest
        args: ["--model", "ehartford/dolphin-2.5-mixtral-8x7b", "--host", "0.0.0.0", "--tensor-parallel-size", "8"]
        # check the health of the container by hitting port 8000/health endpoint
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 5
          periodSeconds: 1
        ports:
        - containerPort: 8000
          name: vllm-port
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          value: ${HUGGING_FACE_HUB_TOKEN}
        resources:
          limits:
            nvidia.com/gpu: 8 # Requesting one GPU per pod
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm

Although this deployment worked fine with previous versions of the vLLM docker container, on the latest version, the following error occurs after the model is finished loading:

File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main0:00, 92.7MB/s]
    return _run_code(code, main_globals, None,   | 1.43G/4.22G [00:39<01:01, 45.1MB/s]
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code2G [01:21<00:00, 71.5MB/s]
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 217, in <module>
    engine = AsyncLLMEngine.from_engine_args(engine_args)
  File "/workspace/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
    engine = cls(parallel_config.worker_use_ray,
  File "/workspace/vllm/engine/async_llm_engine.py", line 319, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/workspace/vllm/engine/async_llm_engine.py", line 364, in _init_engine
    return engine_class(*args, **kwargs)
  File "/workspace/vllm/engine/llm_engine.py", line 114, in __init__
    self._init_cache()
  File "/workspace/vllm/engine/llm_engine.py", line 308, in _init_cache
    num_blocks = self._run_workers(
  File "/workspace/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/worker.py", line 116, in profile_num_available_blocks
    self.model_runner.profile_run()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 599, in profile_run
    self.execute_model(seqs, kv_caches)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 534, in execute_model
    hidden_states = model_executable(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 347, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 319, in forward
    hidden_states, residual = layer(positions, hidden_states,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 283, in forward
    hidden_states = self.block_sparse_moe(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mixtral.py", line 137, in forward
    final_hidden_states = fused_moe(hidden_states,
  File "/workspace/vllm/model_executor/layers/fused_moe.py", line 270, in fused_moe
    invoke_fused_moe_kernel(hidden_states, w1, intermediate_cache1,
  File "/workspace/vllm/model_executor/layers/fused_moe.py", line 187, in invoke_fused_moe_kernel
    fused_moe_kernel[grid](
  File "<string>", line 63, in fused_moe_kernel
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/compiler.py", line 425, in compile
    so_path = make_stub(name, signature, constants)
  File "/usr/local/lib/python3.10/dist-packages/triton/compiler/make_launcher.py", line 39, in make_stub
    so = _build(name, src_path, tmpdir)
  File "/usr/local/lib/python3.10/dist-packages/triton/common/build.py", line 61, in _build
    cuda_lib_dirs = libcuda_dirs()
  File "/usr/local/lib/python3.10/dist-packages/triton/common/build.py", line 30, in libcuda_dirs
    assert any(os.path.exists(os.path.join(path, 'libcuda.so')) for path in dirs), msg
AssertionError: libcuda.so cannot found!
@simon-mo simon-mo self-assigned this Feb 5, 2024
@alsichcan
Copy link

alsichcan commented Feb 13, 2024

Also having the issue with docker deployment of vLLM.

I have pulled the v0.3.0 image from dockerhub and created the container with following options

docker run -d --name graph_llm \
    --runtime nvidia \
    --gpus '"device=1,2"' \
    --ipc=host \
    vllm/vllm-openai:latest \
    --model mistralai/Mistral-7B-Instruct-v0.2 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 2 \
    --engine-use-ray

It worked just fine with v0.2.7 of vllm with following docker logs

INFO 02-13 13:10:55 api_server.py:727] args: Namespace(host=None, port=8000, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=2, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=False, max_context_len_to_capture=8192, engine_use_ray=True, disable_log_requests=False, max_log_len=None)
2024-02-13 13:10:57,002 INFO worker.py:1724 -- Started a local Ray instance.
(_AsyncLLMEngine pid=3644) Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
(_AsyncLLMEngine pid=3644) [W CUDAGraph.cpp:145] Warning: Waiting for pending NCCL work to finish before starting graph capture. (function operator())
(_AsyncLLMEngine pid=3644) INFO 02-13 13:10:59 llm_engine.py:70] Initializing an LLM engine with config: model='mistralai/Mistral-7B-Instruct-v0.2', tokenizer='mistralai/Mistral-7B-Instruct-v0.2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, enforce_eager=False, seed=0)
(_AsyncLLMEngine pid=3644) INFO 02-13 13:11:10 llm_engine.py:275] # GPU blocks: 33269, # CPU blocks: 4096
(_AsyncLLMEngine pid=3644) INFO 02-13 13:11:11 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(_AsyncLLMEngine pid=3644) INFO 02-13 13:11:11 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
(_AsyncLLMEngine pid=3644) INFO 02-13 13:11:46 model_runner.py:547] Graph capturing finished in 35 secs.
(RayWorkerVllm pid=3764) INFO 02-13 13:11:11 model_runner.py:501] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(RayWorkerVllm pid=3764) INFO 02-13 13:11:11 model_runner.py:505] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode.
INFO 02-13 13:11:46 api_server.py:121] Using default chat template:
INFO 02-13 13:11:46 api_server.py:121] {{ bos_token }}{% for message in messages %}{% if (message['role'] == 'user') != (loop.index0 % 2 == 0) %}{{ raise_exception('Conversation roles must alternate user/assistant/user/assistant/...') }}{% endif %}{% if message['role'] == 'user' %}{{ '[INST] ' + message['content'] + ' [/INST]' }}{% elif message['role'] == 'assistant' %}{{ message['content'] + eos_token}}{% else %}{{ raise_exception('Only user and assistant roles are supported!') }}{% endif %}{% endfor %}
INFO:     Started server process [1]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)

However, with the same environment v0.3.0 would raise CUDA error with following docker logs:

2024-02-13 13:15:48,622 INFO worker.py:1724 -- Started a local Ray instance.
(_AsyncLLMEngine pid=3642) Using blocking ray.get inside async actor. This blocks the event loop. Please use `await` on object ref with asyncio.gather if you want to yield execution to the event loop instead.
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/workspace/vllm/entrypoints/openai/api_server.py", line 218, in <module>
    openai_serving_chat = OpenAIServingChat(engine, served_model,
  File "/workspace/vllm/entrypoints/openai/serving_chat.py", line 26, in __init__
    super().__init__(engine=engine, served_model=served_model)
  File "/workspace/vllm/entrypoints/openai/serving_engine.py", line 34, in __init__
    asyncio.run(self._post_init())
  File "/usr/lib/python3.10/asyncio/runners.py", line 44, in run
    return loop.run_until_complete(main)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/workspace/vllm/entrypoints/openai/serving_engine.py", line 37, in _post_init
    engine_model_config = await self.engine.get_model_config()
  File "/workspace/vllm/engine/async_llm_engine.py", line 607, in get_model_config
    return await self.engine.get_model_config.remote()
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, ray::_AsyncLLMEngine.__init__() (pid=3642, ip=172.17.0.3, actor_id=7c13edc0c104d61a0fff650901000000, repr=<vllm.engine.async_llm_engine._AsyncLLMEngine object at 0x7fb6b4440c10>)
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/usr/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
  File "/workspace/vllm/engine/llm_engine.py", line 114, in __init__
    self._init_cache()
  File "/workspace/vllm/engine/llm_engine.py", line 345, in _init_cache
    self._run_workers("warm_up_model")
  File "/workspace/vllm/engine/llm_engine.py", line 983, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/workspace/vllm/worker/worker.py", line 148, in warm_up_model
    self.model_runner.capture_model(self.gpu_cache)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/workspace/vllm/worker/model_runner.py", line 685, in capture_model
    graph_runner.capture(
  File "/workspace/vllm/worker/model_runner.py", line 732, in capture
    hidden_states = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mistral.py", line 303, in forward
    hidden_states = self.model(input_ids, positions, kv_caches,
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mistral.py", line 256, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mistral.py", line 214, in forward
    hidden_states = self.mlp(hidden_states)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/models/mistral.py", line 77, in forward
    gate_up, _ = self.gate_up_proj(x)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/workspace/vllm/model_executor/layers/linear.py", line 211, in forward
    output_parallel = self.linear_method.apply_weights(
  File "/workspace/vllm/model_executor/layers/linear.py", line 72, in apply_weights
    return F.linear(x, weight, bias)
RuntimeError: CUDA error: invalid device function
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

My server has 'RTX 6000 Ada Generation D6 48GB', which supports compute capability with 11.8, 12.0 - 12.4, therefore I think it wouldn't be an issue with GPU.

I guess it's the compatibility issue with Ray, as the Ray version requirement was updated at Jan 29, 2024 in commit 7d64841 following an issue #2636.

I'd really appreciate it if you could take a look at this issue along with @sarahwooders's issue.
Thanks a bunch for your help!

@hmellor
Copy link
Collaborator

hmellor commented Aug 28, 2024

Solved by #2845

@hmellor hmellor closed this as completed Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants