Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v0.7.3-dev] FAQ / Feedback | 问题/反馈 #267

Open
Yikun opened this issue Mar 7, 2025 · 8 comments
Open

[v0.7.3-dev] FAQ / Feedback | 问题/反馈 #267

Yikun opened this issue Mar 7, 2025 · 8 comments

Comments

@Yikun
Copy link
Collaborator

Yikun commented Mar 7, 2025

Anything you want to discuss about vllm on ascend.

Please use doc: https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/


请使用 https://vllm-ascend.readthedocs.io/en/v0.7.3-dev/ 安装

@Yikun Yikun pinned this issue Mar 7, 2025
@Yikun Yikun changed the title [v0.7.3-dev] FAQ & Feedback | 常见问题与反馈 [v0.7.3-dev] FAQ & Feedback | 问题与反馈 Mar 7, 2025
@Yikun Yikun changed the title [v0.7.3-dev] FAQ & Feedback | 问题与反馈 [v0.7.3-dev] FAQ&Feedback | 问题与反馈 Mar 7, 2025
@Yikun Yikun changed the title [v0.7.3-dev] FAQ&Feedback | 问题与反馈 [v0.7.3-dev] FAQ / Feedback | 问题与反馈 Mar 7, 2025
@Yikun Yikun changed the title [v0.7.3-dev] FAQ / Feedback | 问题与反馈 [v0.7.3-dev] FAQ / Feedback | 问题/反馈 Mar 7, 2025
@dawnranger
Copy link

Is there a detailed release note about the difference between v0.7.1 and v0.7.3?

@Yikun
Copy link
Collaborator Author

Yikun commented Mar 7, 2025

@dawnranger The 0.7.3rc1 will be released in next week with new release note. Include DeepSeek related fix, accuraccy fix, qwen2-vl improvement.

@gameofdimension
Copy link

#264 failed to start service with oom error

@SHYuanBest
Copy link

SHYuanBest commented Mar 9, 2025

Traceback (most recent call last): File "/work/share/projects/ysh/ConsisID-X/vllm_code/test_llm.py", line 55, in llm = LLM(model="/work/share/checkpoint/ysh/Qwen2.5-72B-Instruct/", File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1022, in inner return fn(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 242, in init self.llm_engine = self.engine_class.from_engine_args( File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 486, in from_engine_args engine_config = engine_args.create_engine_config(usage_context) File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 1127, in create_engine_config model_config = self.create_model_config() File "/usr/local/lib/python3.10/dist-packages/vllm/engine/arg_utils.py", line 1047, in create_model_config return ModelConfig( File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 366, in init self.multimodal_config = self._init_multimodal_config( File "/usr/local/lib/python3.10/dist-packages/vllm/config.py", line 427, in _init_multimodal_config if ModelRegistry.is_multimodal_model(architectures): File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 460, in is_multimodal_model model_cls, _ = self.inspect_model_cls(architectures) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 420, in inspect_model_cls return self._raise_for_unsupported(architectures) File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/registry.py", line 372, in _raise_for_unsupported raise ValueError( ValueError: Model architectures ['Qwen2ForCausalLM'] failed to be inspected. Please check the logs for more details. [ERROR] 2025-03-08-16:06:16 (PID:200608, Device:-1, RankID:-1) ERR99999 UNKNOWN applicaiton exception /usr/lib/python3.10/tempfile.py:1008: ResourceWarning: Implicitly cleaning up <TemporaryDirectory '/tmp/tmp9rr8w9lk'> _warnings.warn(warn_message, ResourceWarning)

import torch
from decord import VideoReader, cpu
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader

import importlib
if importlib.util.find_spec("torch_npu") is not None:
    import torch_npu
    from torch_npu.contrib import transfer_to_npu
else:
    torch_npu = None


import gc
from tqdm import tqdm
from vllm import LLM, SamplingParams
from vllm.distributed.parallel_state import (destroy_distributed_environment,
                                             destroy_model_parallel)


def clean_up():
    destroy_model_parallel()
    destroy_distributed_environment()
    gc.collect()
    torch.npu.empty_cache()

prompts = ["Hello"]

sampling_params = SamplingParams(
                        repetition_penalty=1.05,
                        temperature=0.7,
                        top_p=0.8,
                        top_k=20,
                        max_tokens=512
                    )
llm = LLM(model="Qwen2.5-72B-Instruct/",
          tensor_parallel_size=8,
          distributed_executor_backend="mp",
          max_model_len=26240)

# for _ in tqdm(range(100)):
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    # print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
    print(f"{generated_text}")

del llm
clean_up()

pip uninstall decord can fix the bug, but i want to use decord.

@SHYuanBest
Copy link

/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py:29: ResourceWarning: unclosed <socket.socket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_DGRAM, proto=0, laddr=('10.225.17.23', 38084), raddr=('8.8.8.8', 80)>
get_ip(), get_open_port())
WARNING 03-09 15:43:10 utils.py:2262] Methods add_lora,add_prompt_adapter,cache_config,compilation_config,current_platform,list_loras,list_prompt_adapters,load_config,pin_lora,pin_prompt_adapter,remove_lora,remove_prompt_adapter not implemented in <vllm_ascend.worker.worker.NPUWorker object at 0xfffc916cce50>
[rank0]: Traceback (most recent call last):
[rank0]: File "/work/share/projects/ysh/1_Code/vllm_code/test_mllm.py", line 26, in
[rank0]: llm = LLM(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 1022, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/llm.py", line 242, in init
[rank0]: self.llm_engine = self.engine_class.from_engine_args(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 489, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 273, in init
[rank0]: self.model_executor = executor_class(vllm_config=vllm_config, )
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 52, in init
[rank0]: self._init_executor()
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 47, in _init_executor
[rank0]: self.collective_rpc("load_model")
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/executor/uniproc_executor.py", line 56, in collective_rpc
[rank0]: answer = run_method(self.driver_worker, method, args, kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/utils.py", line 2196, in run_method
[rank0]: return func(*args, **kwargs)
[rank0]: File "/work/share/projects/ysh/0_local_env/dev_vllm/vllm-ascend/vllm_ascend/worker/worker.py", line 179, in load_model
[rank0]: self.model_runner.load_model()
[rank0]: File "/work/share/projects/ysh/0_local_env/dev_vllm/vllm-ascend/vllm_ascend/worker/model_runner.py", line 818, in load_model
[rank0]: self.model = get_model(vllm_config=self.vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/init.py", line 14, in get_model
[rank0]: return loader.load_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 406, in load_model
[rank0]: model = _initialize_model(vllm_config=vllm_config)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/model_loader/loader.py", line 125, in _initialize_model
[rank0]: return model_class(vllm_config=vllm_config, prefix=prefix)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 774, in init
[rank0]: self.visual = Qwen2_5_VisionTransformer(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 512, in init
[rank0]: self.blocks = nn.ModuleList([
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 513, in
[rank0]: Qwen2_5_VisionBlock(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 355, in init
[rank0]: self.attn = Qwen2_5_VisionAttention(embed_dim=dim,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/qwen2_5_vl.py", line 217, in init
[rank0]: self.qkv = ColumnParallelLinear(input_size=embed_dim,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 305, in init
[rank0]: super().init(input_size, output_size, skip_bias_add, params_dtype,
[rank0]: File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 179, in init
[rank0]: self.quant_method = quant_config.get_quant_method(self,
[rank0]: File "/work/share/projects/ysh/0_local_env/dev_vllm/vllm-ascend/vllm_ascend/quantization/quant_config.py", line 88, in get_quant_method
[rank0]: if self.is_layer_skipped_ascend(prefix,
[rank0]: File "/work/share/projects/ysh/0_local_env/dev_vllm/vllm-ascend/vllm_ascend/quantization/quant_config.py", line 122, in is_layer_skipped_ascend
[rank0]: is_skipped = self.quant_description[prefix + '.weight'] == "FLOAT"
[rank0]: KeyError: 'visual.blocks.0.attn.qkv.weight'
[ERROR] 2025-03-09-15:43:16 (PID:320010, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception

import torch
# from decord import VideoReader, cpu
from torchvision import transforms
from torch.utils.data import Dataset, DataLoader

import importlib
if importlib.util.find_spec("torch_npu") is not None:
    import torch_npu
    from torch_npu.contrib import transfer_to_npu
else:
    torch_npu = None

from argparse import Namespace
from typing import List, NamedTuple, Optional

from PIL.Image import Image
from transformers import AutoProcessor, AutoTokenizer

from vllm import LLM, SamplingParams
from vllm.multimodal.utils import fetch_image
from vllm.utils import FlexibleArgumentParser
from qwen_vl_utils import process_vision_info

model_name = "Qwen2.5-VL-72B-Instruct-AWQ"

llm = LLM(
    model=model_name,
    max_model_len=32768 if process_vision_info is None else 4096,
    max_num_seqs=5,  # batch_size
    mm_processor_kwargs={
            "min_pixels": 28 * 28,
            "max_pixels": 1280 * 28 * 28,
            "fps": 1,
        },
    limit_mm_per_prompt={"image": 10, "video": 10},
)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "system",
        "content": "You are a helpful assistant."
    },
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_data, _ = process_vision_info(messages)

stop_token_ids = None

sampling_params = SamplingParams(
                        repetition_penalty=1.05,
                        temperature=0.1,
                        top_p=0.001,
                        top_k=1,
                        max_tokens=512,
                        stop_token_ids=stop_token_ids
                    )

inputs = {
        "prompt": prompt,
        "multi_modal_data": {
            "image": image_data
        },
    }

outputs = llm.generate(inputs, sampling_params=sampling_params)

for o in outputs:
    generated_text = o.outputs[0].text
    print(generated_text)

del llm
clean_up()

@wangxiyuan
Copy link
Collaborator

@SHYuanBest model_name = "Qwen2.5-VL-72B-Instruct-AWQ" vllm-ascend doesn't support quantization currentlly.

@wangxiyuan
Copy link
Collaborator

@SHYuanBest model="Qwen2.5-72B-Instruct/" this model error raised by vllm before loading model, it looks like a vllm bug. And the log is not enough, can you print more info?

@man-in-sky
Copy link

[model]:deepseek-r1-bf16
[machine]:32p
[version]:0.7.3-dev
[excute]:offline
[script]:

      llm = LLM(model="./deepseek-r1_bf16/",
                tensor_parallel_size=16,
                pipeline_parallel_size=2, 
                distributed_executor_backend="ray",
                trust_remote_code=True,
                max_model_len=48)

[ERROR]:

INFO 03-10 02:08:15 executor_base.py:111] # npu blocks: 18092, # CPU blocks: 7516
INFO 03-10 02:08:15 executor_base.py:116] Maximum concurrency for 48 tokens per request: 6030.67x
INFO 03-10 02:08:19 llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 19.77 seconds
Processed prompts: 0%| | 0/2 [00:00<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s][rank0]: Traceback (most recent call last):
[rank0]: File "/home/t00676981/big_model/vllm_ascend/_offine_inference.py", line 27, in
[rank0]: outputs = llm.generate(prompts, sampling_params)
[rank0]: File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/utils.py", line 1057, in inner
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 469, in generate
[rank0]: outputs = self._run_engine(use_tqdm=use_tqdm)
[rank0]: File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 1397, in _run_engine
[rank0]: step_outputs = self.llm_engine.step()
[rank0]: File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 1314, in step
[rank0]: raise NotImplementedError(
[rank0]: NotImplementedError: Pipeline parallelism is only supported through AsyncLLMEngine as performance will be severely degraded otherwise.
(NPURayWorkerWrapper pid=150823) [rank14]:[W310 02:08:14.691525996 MoeInitRoutingKernelNpuOpApi.cpp:28] Warning: The oprator of MoeInitRouting will be removed from Pytorch and switch to AscendSpeed after 630. (function operator()) [repeated 30x across cluster]
[ERROR] 2025-03-10-02:08:23 (PID:150515, Device:0, RankID:-1) ERR99999 UNKNOWN applicaiton exception

Processed prompts: 0%| | 0/2 [00:03<?, ?it/s, est. speed input: 0.00 toks/s, output: 0.00 toks/s]
Exception ignored in: <function LLMEngine.del at 0x7f6a6089bd90>
Traceback (most recent call last):
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 508, in del
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 104, in shutdown
AttributeError: 'NoneType' object has no attribute 'info'
Exception ignored in: <function RayDistributedExecutor.del at 0x7f69b9e38a60>
Traceback (most recent call last):
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 577, in del
File "/usr/local/python3.10/lib/python3.10/site-packages/vllm/executor/ray_distributed_executor.py", line 104, in shutdown
AttributeError: 'NoneType' object has no attribute 'info'

Is PP only used in service-oriented applications?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants