[Feature][WIP] Prototype of vLLM execution on Intel GPU devices via SYCL. #2378

jikunshang · 2024-01-08T08:34:18Z

This is follow up for PR #1028.
Will refactor and separate this to several smaller PR later. mainly contains below items:

Introduce a new device type: XPU.
model_exeuctor update(adapt for models/layers)
kernel dispatch based on device type.
adapt vllm core/engine/scheduler/cache manager.
compile/build script
SYCL kernel implementation
testing

prepare env

make sure hardware, gpu driver ready.
install onapi toolkit base

how to build

# intel torch/ipex version are py310 currently, please change if you are using other version
pip install -r -e requirements-build-xpu.txt
# source oneapi_2024.0
export VLLM_BUILD_XPU_OPS=1 
pip install --no-build-isolation -v -e .

# install other runtime dependency
pip install --no-deps xformers
wget https://intel-extension-for-pytorch.s3.amazonaws.com/ipex_stable/xpu/oneccl_bind_pt-2.1.100%2Bxpu-cp310-cp310-linux_x86_64.whl 
pip install oneccl_bind_pt-2.1.100+xpu-cp310-cp310-linux_x86_64.whl

how to run ut

pytest tests/kernels/test_layernorm.py::test_rms_norm_xpu
pytest tests/kernels/test_attention.py::test_paged_attention_xpu
pytest tests/kernels/test_pos_encoding.py::test_rotary_embedding_xpu
pytest tests/kernels/test_cache.py::test_reshape_and_cache_xpu
pytest tests/kernels/test_cache.py::test_copy_blocks_xpu
pytest tests/kernels/test_activation.py::test_silu_and_mul_xpu

how to run E2E test

# need to change model path 
python3 example/offline_inference.py

xwu99 · 2024-01-23T03:30:11Z

setup.py

+BUILD_CPU_ONLY = os.getenv('VLLM_BUILD_CPU_ONLY', "0") == "1"
+BUILD_XPU_OPS = os.getenv('VLLM_BUILD_XPU_OPS', "0") == "1"
+if BUILD_XPU_OPS:
+    from xpu_extension.xpu_cpp_extension import DPCPPExtension, DpcppBuildExtension 


Can utilize extensions what ipex already provides:
https://intel.github.io/intel-extension-for-pytorch/xpu/latest/tutorials/features/DPC%2B%2B_Extension.html

thanks for your comments. already updated. please check latest change.

maktukmak · 2024-03-01T21:13:39Z

When I try to install the package via pip install --no-build-isolation -v -e ., I get the following error:

2024-03-01T21:09:47,860 Building wheels for collected packages: vllm
2024-03-01T21:09:47,861   Created temporary directory: /tmp/pip-wheel-1bgol7v1
2024-03-01T21:09:47,861   Destination directory: /tmp/pip-wheel-1bgol7v1
2024-03-01T21:09:47,861   Running command Building editable for vllm (pyproject.toml)
2024-03-01T21:09:50,812   /home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/cpp_extension.py:1564: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
2024-03-01T21:09:50,812     warnings.warn(
2024-03-01T21:09:50,827   2024-03-01 21:09:50,827 - root - INFO - running editable_wheel
2024-03-01T21:09:50,832   2024-03-01 21:09:50,832 - root - INFO - creating /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/PKG-INFO
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing dependency_links to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/dependency_links.txt
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing requirements to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/requires.txt
2024-03-01T21:09:50,835   2024-03-01 21:09:50,834 - root - INFO - writing top-level names to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/top_level.txt
2024-03-01T21:09:50,835   2024-03-01 21:09:50,835 - root - INFO - writing manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,862   2024-03-01 21:09:50,862 - root - INFO - reading manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,862   2024-03-01 21:09:50,862 - root - INFO - reading manifest template 'MANIFEST.in'
2024-03-01T21:09:50,863   2024-03-01 21:09:50,863 - root - INFO - adding license file 'LICENSE'
2024-03-01T21:09:50,864   2024-03-01 21:09:50,864 - root - INFO - writing manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,864   2024-03-01 21:09:50,864 - root - INFO - creating '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm-0.3.2+xpu0.0.1.dist-info'
2024-03-01T21:09:50,878   2024-03-01 21:09:50,878 - wheel - INFO - creating /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm-0.3.2+xpu0.0.1.dist-info/WHEEL
2024-03-01T21:09:50,890   2024-03-01 21:09:50,890 - root - INFO - running build_py
2024-03-01T21:09:50,890   2024-03-01 21:09:50,890 - root - INFO - running build_ext
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - building 'vllm._C' extension
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - creating /tmp/tmpd6uzw_2s.build-temp/csrc
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - creating /tmp/tmpd6uzw_2s.build-temp/csrc/xpu
2024-03-01T21:09:50,911   Emitting ninja build file /tmp/tmpd6uzw_2s.build-temp/build.ninja...
2024-03-01T21:09:50,911   Compiling objects...
2024-03-01T21:09:50,911   Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
2024-03-01T21:09:51,153   [1/8] /opt/intel/oneapi/compiler/2024.0/bin/icpx -MMD -MF /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o.d -pthread -B /home/sdp/.conda/envs/vllm_xpu/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2024.0/linux/include -I/opt/intel/oneapi/compiler/2024.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2024.0/include -I/opt/intel/oneapi/dnnl/2024.0/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/include -I/home/sdp/.conda/envs/vllm_xpu/include/python3.10 -c -c /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp -o /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o -DVLLM_BUILD_XPU_OPS -fsycl -fsycl-targets=spir64 -fsycl -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-03-01T21:09:51,154   FAILED: /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o
2024-03-01T21:09:51,154   /opt/intel/oneapi/compiler/2024.0/bin/icpx -MMD -MF /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o.d -pthread -B /home/sdp/.conda/envs/vllm_xpu/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2024.0/linux/include -I/opt/intel/oneapi/compiler/2024.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2024.0/include -I/opt/intel/oneapi/dnnl/2024.0/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/include -I/home/sdp/.conda/envs/vllm_xpu/include/python3.10 -c -c /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp -o /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o -DVLLM_BUILD_XPU_OPS -fsycl -fsycl-targets=spir64 -fsycl -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-03-01T21:09:51,154   In file included from /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp:15:
2024-03-01T21:09:51,154   /home/sdp/projects/vllm/csrc/xpu/dtype_float16.h:25:10: fatal error: 'attention_generic.dp.hpp' file not found
2024-03-01T21:09:51,154      25 | #include "attention_generic.dp.hpp"
2024-03-01T21:09:51,154         |          ^~~~~~~~~~~~~~~~~~~~~~~~~~
2024-03-01T21:09:51,154   1 error generated.

What might be the issue?

jikunshang · 2024-03-02T10:22:46Z

When I try to install the package via pip install --no-build-isolation -v -e ., I get the following error:

2024-03-01T21:09:47,860 Building wheels for collected packages: vllm
2024-03-01T21:09:47,861   Created temporary directory: /tmp/pip-wheel-1bgol7v1
2024-03-01T21:09:47,861   Destination directory: /tmp/pip-wheel-1bgol7v1
2024-03-01T21:09:47,861   Running command Building editable for vllm (pyproject.toml)
2024-03-01T21:09:50,812   /home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/xpu/cpp_extension.py:1564: UserWarning: This extension has static linked onednn library. Please attaction to                 that, this path of onednn version maybe not match with the built-in version.
2024-03-01T21:09:50,812     warnings.warn(
2024-03-01T21:09:50,827   2024-03-01 21:09:50,827 - root - INFO - running editable_wheel
2024-03-01T21:09:50,832   2024-03-01 21:09:50,832 - root - INFO - creating /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/PKG-INFO
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing dependency_links to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/dependency_links.txt
2024-03-01T21:09:50,834   2024-03-01 21:09:50,834 - root - INFO - writing requirements to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/requires.txt
2024-03-01T21:09:50,835   2024-03-01 21:09:50,834 - root - INFO - writing top-level names to /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/top_level.txt
2024-03-01T21:09:50,835   2024-03-01 21:09:50,835 - root - INFO - writing manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,862   2024-03-01 21:09:50,862 - root - INFO - reading manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,862   2024-03-01 21:09:50,862 - root - INFO - reading manifest template 'MANIFEST.in'
2024-03-01T21:09:50,863   2024-03-01 21:09:50,863 - root - INFO - adding license file 'LICENSE'
2024-03-01T21:09:50,864   2024-03-01 21:09:50,864 - root - INFO - writing manifest file '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm.egg-info/SOURCES.txt'
2024-03-01T21:09:50,864   2024-03-01 21:09:50,864 - root - INFO - creating '/tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm-0.3.2+xpu0.0.1.dist-info'
2024-03-01T21:09:50,878   2024-03-01 21:09:50,878 - wheel - INFO - creating /tmp/pip-wheel-1bgol7v1/.tmp-5w_l9a85/vllm-0.3.2+xpu0.0.1.dist-info/WHEEL
2024-03-01T21:09:50,890   2024-03-01 21:09:50,890 - root - INFO - running build_py
2024-03-01T21:09:50,890   2024-03-01 21:09:50,890 - root - INFO - running build_ext
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - building 'vllm._C' extension
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - creating /tmp/tmpd6uzw_2s.build-temp/csrc
2024-03-01T21:09:50,891   2024-03-01 21:09:50,891 - root - INFO - creating /tmp/tmpd6uzw_2s.build-temp/csrc/xpu
2024-03-01T21:09:50,911   Emitting ninja build file /tmp/tmpd6uzw_2s.build-temp/build.ninja...
2024-03-01T21:09:50,911   Compiling objects...
2024-03-01T21:09:50,911   Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
2024-03-01T21:09:51,153   [1/8] /opt/intel/oneapi/compiler/2024.0/bin/icpx -MMD -MF /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o.d -pthread -B /home/sdp/.conda/envs/vllm_xpu/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2024.0/linux/include -I/opt/intel/oneapi/compiler/2024.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2024.0/include -I/opt/intel/oneapi/dnnl/2024.0/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/include -I/home/sdp/.conda/envs/vllm_xpu/include/python3.10 -c -c /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp -o /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o -DVLLM_BUILD_XPU_OPS -fsycl -fsycl-targets=spir64 -fsycl -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-03-01T21:09:51,154   FAILED: /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o
2024-03-01T21:09:51,154   /opt/intel/oneapi/compiler/2024.0/bin/icpx -MMD -MF /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o.d -pthread -B /home/sdp/.conda/envs/vllm_xpu/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -fwrapv -O2 -Wall -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -O2 -isystem /home/sdp/.conda/envs/vllm_xpu/include -fPIC -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/include/TH -I/opt/intel/oneapi/compiler/2024.0/linux/include -I/opt/intel/oneapi/compiler/2024.0/linux/include/sycl -I/opt/intel/oneapi/mkl/2024.0/include -I/opt/intel/oneapi/dnnl/2024.0/include -I/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/intel_extension_for_pytorch/include -I/home/sdp/.conda/envs/vllm_xpu/include/python3.10 -c -c /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp -o /tmp/tmpd6uzw_2s.build-temp/csrc/xpu/attention_xpu.o -DVLLM_BUILD_XPU_OPS -fsycl -fsycl-targets=spir64 -fsycl -DTORCH_API_INCLUDE_EXTENSION_H '-DPYBIND11_COMPILER_TYPE="_gcc"' '-DPYBIND11_STDLIB="_libstdcpp"' '-DPYBIND11_BUILD_ABI="_cxxabi1016"' -DTORCH_EXTENSION_NAME=_C -D_GLIBCXX_USE_CXX11_ABI=1 -std=c++17
2024-03-01T21:09:51,154   In file included from /home/sdp/projects/vllm/csrc/xpu/attention_xpu.cpp:15:
2024-03-01T21:09:51,154   /home/sdp/projects/vllm/csrc/xpu/dtype_float16.h:25:10: fatal error: 'attention_generic.dp.hpp' file not found
2024-03-01T21:09:51,154      25 | #include "attention_generic.dp.hpp"
2024-03-01T21:09:51,154         |          ^~~~~~~~~~~~~~~~~~~~~~~~~~
2024-03-01T21:09:51,154   1 error generated.

What might be the issue?

Oh sorry, I didn't verify latest code, there are some refactor from another developer. Can you checkout to commit id 8cdfae2 and try compile with prefix VLLM_BUILD_XPU_OPS=1 again.

ilya-lavrenov · 2024-03-07T15:35:03Z

Does it support tensor parallelism via multiple GPUs and oneCCL?

jikunshang · 2024-03-08T02:10:12Z

Does it support tensor parallelism via multiple GPUs and oneCCL?

I am working on another branch which could run tensor parallel on PVC, arc not works yet. will sub mt another PR to support when this got merged. https://github.com/jikunshang/vllm/tree/tp

ilya-lavrenov · 2024-03-08T10:12:12Z

Does it support tensor parallelism via multiple GPUs and oneCCL?

I am working on another branch which could run tensor parallel on PVC, arc not works yet. will sub mt another PR to support when this got merged. https://github.com/jikunshang/vllm/tree/tp

do you observe performance boost when 7B model is executed on 2 GPUs?
or this mode is valid for cases to fit 70B models to several GPUs?

jikunshang · 2024-03-08T11:38:55Z

Does it support tensor parallelism via multiple GPUs and oneCCL?

I am working on another branch which could run tensor parallel on PVC, arc not works yet. will sub mt another PR to support when this got merged. https://github.com/jikunshang/vllm/tree/tp

do you observe performance boost when 7B model is executed on 2 GPUs? or this mode is valid for cases to fit 70B models to several GPUs?

Actually, performance drop about 1X on llama-2-7b and llama-2-13b, we are still investigating the root cause.

maktukmak · 2024-03-08T18:47:29Z

When I run python examples/offline_inference.py, I get the following error:

INFO 03-08 18:37:47 llm_engine.py:79] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Traceback (most recent call last):
  File "/home/sdp/projects/vllm/examples/offline_inference.py", line 14, in <module>
    llm = LLM(model="facebook/opt-125m")
  File "/home/sdp/projects/vllm/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 372, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 120, in __init__
    self._init_workers()
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 164, in _init_workers
    self._run_workers("init_model")
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 1018, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/home/sdp/projects/vllm/vllm/worker/worker.py", line 91, in init_model
    torch.cuda.set_device(self.device)
  File "/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

My setup has GPU Max 1100. I think this error is because of that cuda dependency still exists in runtime even though cuda libraries are not installed. In CPU PR (#1028), this was solved, i.e., CPU-only installation and runtime were possible. Maybe apply the same thing here too?

jikunshang · 2024-03-11T00:38:03Z

When I run python examples/offline_inference.py, I get the following error:

INFO 03-08 18:37:47 llm_engine.py:79] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Traceback (most recent call last):
  File "/home/sdp/projects/vllm/examples/offline_inference.py", line 14, in <module>
    llm = LLM(model="facebook/opt-125m")
  File "/home/sdp/projects/vllm/vllm/entrypoints/llm.py", line 109, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 372, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 120, in __init__
    self._init_workers()
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 164, in _init_workers
    self._run_workers("init_model")
  File "/home/sdp/projects/vllm/vllm/engine/llm_engine.py", line 1018, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/home/sdp/projects/vllm/vllm/worker/worker.py", line 91, in init_model
    torch.cuda.set_device(self.device)
  File "/home/sdp/.conda/envs/vllm_xpu/lib/python3.10/site-packages/torch/cuda/__init__.py", line 404, in set_device
    torch._C._cuda_setDevice(device)
AttributeError: module 'torch._C' has no attribute '_cuda_setDevice'

My setup has GPU Max 1100. I think this error is because of that cuda dependency still exists in runtime even though cuda libraries are not installed. In CPU PR (#1028), this was solved, i.e., CPU-only installation and runtime were possible. Maybe apply the same thing here too?

emmm, I think it's not necessary. please try to add device="xpu" and enforce_eager=True, this may fix.

maktukmak · 2024-03-11T18:35:25Z

@jikunshang Thanks it worked.

jikunshang · 2024-04-03T05:36:53Z

Close this, you can find latest PR on #3814 and RFC #3725

alexander-potemkin · 2024-09-09T21:08:44Z

Thanks for the feature! Is it the way to run it:

docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
docker run -it \
             --rm \
             --network=host \
             --device /dev/dri \
             -v /dev/dri/by-path:/dev/dri/by-path \
             vllm-xpu-env

as per the doc? Or it's something different?

jikunshang · 2024-09-10T05:30:58Z

Thanks for the feature! Is it the way to run it:

docker build -f Dockerfile.xpu -t vllm-xpu-env --shm-size=4g .
docker run -it \
             --rm \
             --network=host \
             --device /dev/dri \
             -v /dev/dri/by-path:/dev/dri/by-path \
             vllm-xpu-env

as per the doc? Or it's something different?

sycl version support is deprecated. Please follow latest ipex based solution. thanks.

alexander-potemkin · 2024-10-17T10:35:57Z

sycl version support is deprecated. Please follow latest ipex based solution. thanks.

Thank you and apologies for the delay in getting back!

May I ask you why sycl version is deprecated? It's not that I have any good experience with it nor I would advocate for it - but it you could share the background for that decision, it would help me to understand things better!

jikunshang · 2024-10-21T06:41:40Z

sycl version support is deprecated. Please follow latest ipex based solution. thanks.

Thank you and apologies for the delay in getting back!

May I ask you why sycl version is deprecated? It's not that I have any good experience with it nor I would advocate for it - but it you could share the background for that decision, it would help me to understand things better!

SYCL version is hard to maintain and performance is not optimal. IPEX team have experts to maintain these kernels and provide stable API so we choose to use IPEX as backend.

alexander-potemkin · 2024-10-22T09:57:16Z

sycl version support is deprecated. Please follow latest ipex based solution. thanks.

Thank you and apologies for the delay in getting back!
May I ask you why sycl version is deprecated? It's not that I have any good experience with it nor I would advocate for it - but it you could share the background for that decision, it would help me to understand things better!

SYCL version is hard to maintain and performance is not optimal. IPEX team have experts to maintain these kernels and provide stable API so we choose to use IPEX as backend.

Makes sense, thank you!

jikunshang mentioned this pull request Jan 8, 2024

[Feature] SYCL kernel support for Intel GPU #1699

Closed

jikunshang force-pushed the sycl_xpu branch from 2090a12 to cc7fb44 Compare January 9, 2024 02:44

zhouyuan mentioned this pull request Jan 19, 2024

Installing for Intel ARC #1256

Closed

xwu99 reviewed Jan 23, 2024

View reviewed changes

jikunshang force-pushed the sycl_xpu branch from 45272fd to f2cf9a2 Compare January 23, 2024 03:57

zhuohan123 mentioned this pull request Jan 31, 2024

[Roadmap] vLLM Roadmap Q1 2024 #2681

Closed

30 tasks

jikunshang force-pushed the sycl_xpu branch from f2cf9a2 to b5960d0 Compare February 2, 2024 04:24

bigPYJ1151 mentioned this pull request Feb 5, 2024

[Hardware][Intel] Integrate CPU backend build procedure #2753

Closed

jikunshang force-pushed the sycl_xpu branch 3 times, most recently from 49c3c7e to 8cdfae2 Compare February 27, 2024 01:45

jikunshang force-pushed the sycl_xpu branch from d2ee69c to 4b2ac3a Compare March 14, 2024 07:14

jikunshang added 9 commits March 19, 2024 16:54

remove hard code cuda related API.

da5b20f

fix ut due to API change

0c6c9f5

refactor measure_cuda_memory, remove cuda

13d2989

fix format

7fb8306

fix

dccc37d

remove hard code cuda related API.

028834a

add xpu build, dependency and kernels

dcb4313

add xpu

92bdeae

enable tensor parallel on pvc

87fa3aa

abhilash1910 and others added 11 commits March 19, 2024 17:05

add xpu header

875e09c

fix

9cb705b

default build xpu

992c303

fix build issues

b79fa7f

use torch_sdpa backend and other fix

182053d

enable kernel uts

b46cf23

fix sdpa

178fa39

fix get_device()

4e7c357

fix sdpa tensor shape

e2c2cda

reset setup.py

5115131

add xpu build cmake system

317e0a7

jikunshang force-pushed the sycl_xpu branch from b1f8eec to 3fe9067 Compare March 19, 2024 09:28

jikunshang added 9 commits March 20, 2024 00:12

add sycl cmake build system

99e3b34

add kernels

3fe9067

minor

e4a5ba3

fix

054583a

fix

65ce184

update requirements

dfb5a89

minor

eb1ec5c

update requirements

d828af0

refactor worker

6cabc4e

WoosukKwon added x86 CPU and removed x86 CPU labels Mar 25, 2024

jikunshang closed this Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature][WIP] Prototype of vLLM execution on Intel GPU devices via SYCL. #2378

[Feature][WIP] Prototype of vLLM execution on Intel GPU devices via SYCL. #2378

jikunshang commented Jan 8, 2024 •

edited

Loading

xwu99 Jan 23, 2024

jikunshang Jan 23, 2024

maktukmak commented Mar 1, 2024 •

edited

Loading

jikunshang commented Mar 2, 2024

ilya-lavrenov commented Mar 7, 2024

jikunshang commented Mar 8, 2024

ilya-lavrenov commented Mar 8, 2024

jikunshang commented Mar 8, 2024

maktukmak commented Mar 8, 2024 •

edited

Loading

jikunshang commented Mar 11, 2024

maktukmak commented Mar 11, 2024

jikunshang commented Apr 3, 2024

alexander-potemkin commented Sep 9, 2024

jikunshang commented Sep 10, 2024

alexander-potemkin commented Oct 17, 2024

jikunshang commented Oct 21, 2024

alexander-potemkin commented Oct 22, 2024

[Feature][WIP] Prototype of vLLM execution on Intel GPU devices via SYCL. #2378

[Feature][WIP] Prototype of vLLM execution on Intel GPU devices via SYCL. #2378

Conversation

jikunshang commented Jan 8, 2024 • edited Loading

prepare env

how to build

how to run ut

how to run E2E test

xwu99 Jan 23, 2024

Choose a reason for hiding this comment

jikunshang Jan 23, 2024

Choose a reason for hiding this comment

maktukmak commented Mar 1, 2024 • edited Loading

jikunshang commented Mar 2, 2024

ilya-lavrenov commented Mar 7, 2024

jikunshang commented Mar 8, 2024

ilya-lavrenov commented Mar 8, 2024

jikunshang commented Mar 8, 2024

maktukmak commented Mar 8, 2024 • edited Loading

jikunshang commented Mar 11, 2024

maktukmak commented Mar 11, 2024

jikunshang commented Apr 3, 2024

alexander-potemkin commented Sep 9, 2024

jikunshang commented Sep 10, 2024

alexander-potemkin commented Oct 17, 2024

jikunshang commented Oct 21, 2024

alexander-potemkin commented Oct 22, 2024

jikunshang commented Jan 8, 2024 •

edited

Loading

maktukmak commented Mar 1, 2024 •

edited

Loading

maktukmak commented Mar 8, 2024 •

edited

Loading