Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][WIP] Prototype of vLLM execution on Intel GPU devices via SYCL. #2378

Closed
wants to merge 34 commits into from

Conversation

jikunshang
Copy link
Contributor

@jikunshang jikunshang commented Jan 8, 2024

This is follow up for PR #1028.
Will refactor and separate this to several smaller PR later. mainly contains below items:

  • Introduce a new device type: XPU.
  • model_exeuctor update(adapt for models/layers)
  • kernel dispatch based on device type.
  • adapt vllm core/engine/scheduler/cache manager.
  • compile/build script
  • SYCL kernel implementation
  • testing

prepare env

make sure hardware, gpu driver ready.
install onapi toolkit base

how to build

# intel torch/ipex version are py310 currently, please change if you are using other version
pip install -r -e requirements-build-xpu.txt
# source oneapi_2024.0
export VLLM_BUILD_XPU_OPS=1 
pip install --no-build-isolation -v -e .

# install other runtime dependency
pip install --no-deps xformers
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/oneccl_bind_pt-2.1.100%2Bxpu-cp310-cp310-linux_x86_64.whl 
pip install oneccl_bind_pt-2.1.100+xpu-cp310-cp310-linux_x86_64.whl

how to run ut

pytest tests/kernels/test_layernorm.py::test_rms_norm_xpu
pytest tests/kernels/test_attention.py::test_paged_attention_xpu
pytest tests/kernels/test_pos_encoding.py::test_rotary_embedding_xpu
pytest tests/kernels/test_cache.py::test_reshape_and_cache_xpu
pytest tests/kernels/test_cache.py::test_copy_blocks_xpu
pytest tests/kernels/test_activation.py::test_silu_and_mul_xpu

how to run E2E test

# need to change model path 
python3 example/offline_inference.py

setup.py Outdated
BUILD_CPU_ONLY = os.getenv('VLLM_BUILD_CPU_ONLY', "0") == "1"
BUILD_XPU_OPS = os.getenv('VLLM_BUILD_XPU_OPS', "0") == "1"
if BUILD_XPU_OPS:
from xpu_extension.xpu_cpp_extension import DPCPPExtension, DpcppBuildExtension
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for your comments. already updated. please check latest change.

@maktukmak
Copy link

maktukmak commented Mar 1, 2024

When I try to install the package via pip install --no-build-isolation -v -e ., I get the following error:

2024-03-01T21:09:47,860 Building wheels for collected packages: vllm
2024-03-01T21:09:47,861   Created temporary directory: /tmp/pip-wheel-1bgol7v1
2024-03-01T21:09:47,861   Destination directory: /tmp/pip-wheel-1bgol7v1
2024-03-01T21:09:47,861   Running command Building editable for vllm (pyproject.toml)
2024-03-01T21:09:50,812   /home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/cpp_extension.py:1564: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
2024-03-01T21:09:50,812     warnings.warn(
2024-03-01T21:09:50,827   2024-03-01 21:09:50,827 - root - INFO - running editable_wheel
2024-03-01T21:09:50,832   2024-03-01 21:09:50,832 - root - INFO - creating /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/PKG-INFO
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing dependency_links to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/dependency_links.txt
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing requirements to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/requires.txt
2024-03-01T21:09:50,835   2024-03-01 21:09:50,834 - root - INFO - writing top-level names to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/top_level.txt
2024-03-01T21:09:50,835   2024-03-01 21:09:50,835 - root - INFO - writing manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,862   2024-03-01 21:09:50,862 - root - INFO - reading manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,862   2024-03-01 21:09:50,862 - root - INFO - reading manifest template 'MANIFEST.in'
2024-03-01T21:09:50,863   2024-03-01 21:09:50,863 - root - INFO - adding license file 'LICENSE'
2024-03-01T21:09:50,864   2024-03-01 21:09:50,864 - root - INFO - writing manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,864   2024-03-01 21:09:50,864 - root - INFO - creating '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm-0.3.2+xpu0.0.1.dist-info'
2024-03-01T21:09:50,878   2024-03-01 21:09:50,878 - wheel - INFO - creating /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm-0.3.2+xpu0.0.1.dist-info/WHEEL
2024-03-01T21:09:50,890   2024-03-01 21:09:50,890 - root - INFO - running build_py
2024-03-01T21:09:50,890   2024-03-01 21:09:50,890 - root - INFO - running build_ext
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - building 'vllm._C' extension
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - creating /tmp/tmpd6uzw_2s.build-temp/csrc
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - creating /tmp/tmpd6uzw_2s.build-temp/csrc/xpu
2024-03-01T21:09:50,911   Emitting ninja build file /tmp/tmpd6uzw_2s.build-temp/build.ninja...
2024-03-01T21:09:50,911   Compiling objects...
2024-03-01T21:09:50,911   Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
2024-03-01T21:09:51,153   [1/8] /opt/intel/oneapi/compiler/2024.0/bin/icpx -MMD -MF /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o.d -pthread -B /home/sdp/.conda/envs/vllm_xpu/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2024.0/linux/include -I/opt/intel/oneapi/compiler/2024.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2024.0/include -I/opt/intel/oneapi/dnnl/2024.0/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/include -I/home/sdp/.conda/envs/vllm_xpu/include/python3.10 -c -c /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp -o /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o -DVLLM_BUILD_XPU_OPS -fsycl -fsycl-targets=spir64 -fsycl -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-03-01T21:09:51,154   FAILED: /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o
2024-03-01T21:09:51,154   /opt/intel/oneapi/compiler/2024.0/bin/icpx -MMD -MF /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o.d -pthread -B /home/sdp/.conda/envs/vllm_xpu/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2024.0/linux/include -I/opt/intel/oneapi/compiler/2024.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2024.0/include -I/opt/intel/oneapi/dnnl/2024.0/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/include -I/home/sdp/.conda/envs/vllm_xpu/include/python3.10 -c -c /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp -o /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o -DVLLM_BUILD_XPU_OPS -fsycl -fsycl-targets=spir64 -fsycl -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-03-01T21:09:51,154   In file included from /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp:15:
2024-03-01T21:09:51,154   /home/sdp/projects/vllm/csrc/xpu/dtype_float16.h:25:10: fatal error: 'attention_generic.dp.hpp' file not found
2024-03-01T21:09:51,154      25 | #include "attention_generic.dp.hpp"
2024-03-01T21:09:51,154         |          ^~~~~~~~~~~~~~~~~~~~~~~~~~
2024-03-01T21:09:51,154   1 error generated.

What might be the issue?

@jikunshang
Copy link
Contributor Author

When I try to install the package via pip install --no-build-isolation -v -e ., I get the following error:

2024-03-01T21:09:47,860 Building wheels for collected packages: vllm
2024-03-01T21:09:47,861   Created temporary directory: /tmp/pip-wheel-1bgol7v1
2024-03-01T21:09:47,861   Destination directory: /tmp/pip-wheel-1bgol7v1
2024-03-01T21:09:47,861   Running command Building editable for vllm (pyproject.toml)
2024-03-01T21:09:50,812   /home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/cpp_extension.py:1564: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
2024-03-01T21:09:50,812     warnings.warn(
2024-03-01T21:09:50,827   2024-03-01 21:09:50,827 - root - INFO - running editable_wheel
2024-03-01T21:09:50,832   2024-03-01 21:09:50,832 - root - INFO - creating /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/PKG-INFO
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing dependency_links to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/dependency_links.txt
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing requirements to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/requires.txt
2024-03-01T21:09:50,835   2024-03-01 21:09:50,834 - root - INFO - writing top-level names to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/top_level.txt
2024-03-01T21:09:50,835   2024-03-01 21:09:50,835 - root - INFO - writing manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,862   2024-03-01 21:09:50,862 - root - INFO - reading manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,862   2024-03-01 21:09:50,862 - root - INFO - reading manifest template 'MANIFEST.in'
2024-03-01T21:09:50,863   2024-03-01 21:09:50,863 - root - INFO - adding license file 'LICENSE'
2024-03-01T21:09:50,864   2024-03-01 21:09:50,864 - root - INFO - writing manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,864   2024-03-01 21:09:50,864 - root - INFO - creating '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm-0.3.2+xpu0.0.1.dist-info'
2024-03-01T21:09:50,878   2024-03-01 21:09:50,878 - wheel - INFO - creating /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm-0.3.2+xpu0.0.1.dist-info/WHEEL
2024-03-01T21:09:50,890   2024-03-01 21:09:50,890 - root - INFO - running build_py
2024-03-01T21:09:50,890   2024-03-01 21:09:50,890 - root - INFO - running build_ext
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - building 'vllm._C' extension
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - creating /tmp/tmpd6uzw_2s.build-temp/csrc
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - creating /tmp/tmpd6uzw_2s.build-temp/csrc/xpu
2024-03-01T21:09:50,911   Emitting ninja build file /tmp/tmpd6uzw_2s.build-temp/build.ninja...
2024-03-01T21:09:50,911   Compiling objects...
2024-03-01T21:09:50,911   Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
2024-03-01T21:09:51,153   [1/8] /opt/intel/oneapi/compiler/2024.0/bin/icpx -MMD -MF /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o.d -pthread -B /home/sdp/.conda/envs/vllm_xpu/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2024.0/linux/include -I/opt/intel/oneapi/compiler/2024.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2024.0/include -I/opt/intel/oneapi/dnnl/2024.0/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/include -I/home/sdp/.conda/envs/vllm_xpu/include/python3.10 -c -c /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp -o /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o -DVLLM_BUILD_XPU_OPS -fsycl -fsycl-targets=spir64 -fsycl -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-03-01T21:09:51,154   FAILED: /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o
2024-03-01T21:09:51,154   /opt/intel/oneapi/compiler/2024.0/bin/icpx -MMD -MF /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o.d -pthread -B /home/sdp/.conda/envs/vllm_xpu/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2024.0/linux/include -I/opt/intel/oneapi/compiler/2024.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2024.0/include -I/opt/intel/oneapi/dnnl/2024.0/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/include -I/home/sdp/.conda/envs/vllm_xpu/include/python3.10 -c -c /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp -o /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o -DVLLM_BUILD_XPU_OPS -fsycl -fsycl-targets=spir64 -fsycl -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-03-01T21:09:51,154   In file included from /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp:15:
2024-03-01T21:09:51,154   /home/sdp/projects/vllm/csrc/xpu/dtype_float16.h:25:10: fatal error: 'attention_generic.dp.hpp' file not found
2024-03-01T21:09:51,154      25 | #include "attention_generic.dp.hpp"
2024-03-01T21:09:51,154         |          ^~~~~~~~~~~~~~~~~~~~~~~~~~
2024-03-01T21:09:51,154   1 error generated.

What might be the issue?

Oh sorry, I didn't verify latest code, there are some refactor from another developer. Can you checkout to commit id 8cdfae2 and try compile with prefix VLLM_BUILD_XPU_OPS=1 again.

@ilya-lavrenov
Copy link
Contributor

Does it support tensor parallelism via multiple GPUs and oneCCL?

@jikunshang
Copy link
Contributor Author

Does it support tensor parallelism via multiple GPUs and oneCCL?

I am working on another branch which could run tensor parallel on PVC, arc not works yet. will sub mt another PR to support when this got merged. https://github.com/jikunshang/vllm/tree/tp

@ilya-lavrenov
Copy link
Contributor

Does it support tensor parallelism via multiple GPUs and oneCCL?

I am working on another branch which could run tensor parallel on PVC, arc not works yet. will sub mt another PR to support when this got merged. https://github.com/jikunshang/vllm/tree/tp

do you observe performance boost when 7B model is executed on 2 GPUs?
or this mode is valid for cases to fit 70B models to several GPUs?

@jikunshang
Copy link
Contributor Author

Does it support tensor parallelism via multiple GPUs and oneCCL?

I am working on another branch which could run tensor parallel on PVC, arc not works yet. will sub mt another PR to support when this got merged. https://github.com/jikunshang/vllm/tree/tp

do you observe performance boost when 7B model is executed on 2 GPUs? or this mode is valid for cases to fit 70B models to several GPUs?

Actually, performance drop about 1X on llama-2-7b and llama-2-13b, we are still investigating the root cause.

@maktukmak
Copy link

maktukmak commented Mar 8, 2024

When I run python examples/offline_inference.py, I get the following error:

INFO 03-08 18:37:47 llm_engine.py:79] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Traceback (most recent call last):
  File "/home/sdp/projects/vllm/examples/offline_inference.py", line 14, in <module>
    llm = LLM(model="facebook/opt-125m")
  File "/home/sdp/projects/vllm/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 372, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 120, in __init__
    self._init_workers()
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 164, in _init_workers
    self._run_workers("init_model")
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 1018, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/home/sdp/projects/vllm/vllm/worker/worker.py", line 91, in init_model
    torch.cuda.set_device(self.device)
  File "/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

My setup has GPU Max 1100. I think this error is because of that cuda dependency still exists in runtime even though cuda libraries are not installed. In CPU PR (#1028), this was solved, i.e., CPU-only installation and runtime were possible. Maybe apply the same thing here too?

@jikunshang
Copy link
Contributor Author

When I run python examples/offline_inference.py, I get the following error:

INFO 03-08 18:37:47 llm_engine.py:79] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Traceback (most recent call last):
  File "/home/sdp/projects/vllm/examples/offline_inference.py", line 14, in <module>
    llm = LLM(model="facebook/opt-125m")
  File "/home/sdp/projects/vllm/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 372, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 120, in __init__
    self._init_workers()
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 164, in _init_workers
    self._run_workers("init_model")
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 1018, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/home/sdp/projects/vllm/vllm/worker/worker.py", line 91, in init_model
    torch.cuda.set_device(self.device)
  File "/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

My setup has GPU Max 1100. I think this error is because of that cuda dependency still exists in runtime even though cuda libraries are not installed. In CPU PR (#1028), this was solved, i.e., CPU-only installation and runtime were possible. Maybe apply the same thing here too?

emmm, I think it's not necessary. please try to add device="xpu" and enforce_eager=True, this may fix.

@maktukmak
Copy link

@jikunshang Thanks it worked.

@jikunshang
Copy link
Contributor Author

Close this, you can find latest PR on #3814 and RFC #3725

@jikunshang jikunshang closed this Apr 3, 2024
@alexander-potemkin
Copy link

Thanks for the feature! Is it the way to run it:

docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
docker run -it \
             --rm \
             --network=host \
             --device /dev/dri \
             -v /dev/dri/by-path:/dev/dri/by-path \
             vllm-xpu-env

as per the doc? Or it's something different?

@jikunshang
Copy link
Contributor Author

Thanks for the feature! Is it the way to run it:

docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
docker run -it \
             --rm \
             --network=host \
             --device /dev/dri \
             -v /dev/dri/by-path:/dev/dri/by-path \
             vllm-xpu-env

as per the doc? Or it's something different?

sycl version support is deprecated. Please follow latest ipex based solution. thanks.

@alexander-potemkin
Copy link

sycl version support is deprecated. Please follow latest ipex based solution. thanks.

Thank you and apologies for the delay in getting back!

May I ask you why sycl version is deprecated? It's not that I have any good experience with it nor I would advocate for it - but it you could share the background for that decision, it would help me to understand things better!

@jikunshang
Copy link
Contributor Author

sycl version support is deprecated. Please follow latest ipex based solution. thanks.

Thank you and apologies for the delay in getting back!

May I ask you why sycl version is deprecated? It's not that I have any good experience with it nor I would advocate for it - but it you could share the background for that decision, it would help me to understand things better!

SYCL version is hard to maintain and performance is not optimal. IPEX team have experts to maintain these kernels and provide stable API so we choose to use IPEX as backend.

@alexander-potemkin
Copy link

sycl version support is deprecated. Please follow latest ipex based solution. thanks.

Thank you and apologies for the delay in getting back!
May I ask you why sycl version is deprecated? It's not that I have any good experience with it nor I would advocate for it - but it you could share the background for that decision, it would help me to understand things better!

SYCL version is hard to maintain and performance is not optimal. IPEX team have experts to maintain these kernels and provide stable API so we choose to use IPEX as backend.

Makes sense, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants