Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: Trying to add codeshell 7b model, but got an error #11451

Closed
1 task done
G1017 opened this issue Dec 24, 2024 · 21 comments
Closed
1 task done

[Usage]: Trying to add codeshell 7b model, but got an error #11451

G1017 opened this issue Dec 24, 2024 · 21 comments
Labels
usage How to use vllm

Comments

@G1017
Copy link

G1017 commented Dec 24, 2024

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

Model Input Dumps

No response

🐛 Describe the bug

####codeshell.py

from typing import List, Optional, Tuple, Union, Iterable, Set

import torch
from torch import nn
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging

from vllm.attention import Attention, AttentionMetadata
from vllm.config import CacheConfig
from vllm.distributed.parallel_state import (
    get_pp_group, get_tensor_model_parallel_world_size)
from vllm.model_executor.layers.activation import get_act_fn
from vllm.model_executor.layers.linear import (ColumnParallelLinear,
                                               QKVParallelLinear,
                                               RowParallelLinear)
from vllm.model_executor.layers.logits_processor import LogitsProcessor
from vllm.model_executor.layers.quantization.base_config import (
    QuantizationConfig)
from vllm.model_executor.layers.sampler import Sampler
from vllm.model_executor.layers.vocab_parallel_embedding import (
    VocabParallelEmbedding)
from vllm.model_executor.model_loader.weight_utils import default_weight_loader
from vllm.model_executor.sampling_metadata import SamplingMetadata
from vllm.sequence import IntermediateTensors, SamplerOutput

from .utils import is_pp_missing_parameter, make_layers

### 构建config
logger = logging.get_logger(__name__)


class CodeShellConfig(PretrainedConfig):
    model_type = "codeshell"
    keys_to_ignore_at_inference = ["past_key_values"]
    attribute_map = {
        "hidden_size": "n_embd",
        "max_position_embeddings": "n_positions",
        "num_attention_heads": "n_head",
        "num_hidden_layers": "n_layer",
    }

    def __init__(
            self,
            vocab_size=70144,
            n_positions=8192,
            n_embd=4096,
            n_layer=42,
            n_head=32,
            n_inner=None,
            activation_function="gelu_pytorch_tanh",
            resid_pdrop=0.1,
            embd_pdrop=0.1,
            attn_pdrop=0.1,
            layer_norm_epsilon=1e-5,
            initializer_range=0.02,
            scale_attn_weights=True,
            use_cache=True,
            bos_token_id=70000,
            eos_token_id=70000,
            attention_softmax_in_fp32=True,
            scale_attention_softmax_in_fp32=True,
            group_query_attention=True,
            num_query_groups=1,
            position_embedding_type="learned_absolute",
            rope_scaling=None,
            **kwargs,
    ):
        self.vocab_size = vocab_size
        self.n_positions = n_positions
        self.n_embd = n_embd
        self.n_layer = n_layer
        self.n_head = n_head
        self.n_inner = n_inner
        self.activation_function = activation_function
        self.resid_pdrop = resid_pdrop
        self.embd_pdrop = embd_pdrop
        self.attn_pdrop = attn_pdrop
        self.layer_norm_epsilon = layer_norm_epsilon
        self.initializer_range = initializer_range
        self.scale_attn_weights = scale_attn_weights
        self.use_cache = use_cache
        self.attention_softmax_in_fp32 = attention_softmax_in_fp32
        self.scale_attention_softmax_in_fp32 = scale_attention_softmax_in_fp32
        self.group_query_attention = group_query_attention
        self.num_query_groups = num_query_groups
        self.position_embedding_type = position_embedding_type
        self.rope_scaling = rope_scaling
        assert self.position_embedding_type in [
            "learned_absolute", "rope"
        ], "position_embedding_type must be one of ['learned_absolute', 'rope']"

        self.bos_token_id = bos_token_id
        self.eos_token_id = eos_token_id

        super().__init__(bos_token_id=bos_token_id, eos_token_id=eos_token_id, **kwargs)


##实现了 Rotary Positional Embedding
class CodeShellRotaryEmbedding(torch.nn.Module):
    def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
        super().__init__()

        self.dim = dim
        self.max_position_embeddings = max_position_embeddings
        self.base = base
        inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
        self.register_buffer("inv_freq", inv_freq)

        # Build here to make `torch.jit.trace` work.
        self._set_cos_sin_cache(
            seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
        )

    def _set_cos_sin_cache(self, seq_len, device, dtype):
        self.max_seq_len_cached = seq_len
        t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)

        freqs = torch.einsum("i,j->ij", t, self.inv_freq)
        # Different from paper, but it uses a different permutation in order to obtain the same calculation
        emb = torch.cat((freqs, freqs), dim=-1)
        self.register_buffer("cos_cached", emb.cos()[None, None, :, :].to(dtype), persistent=False)
        self.register_buffer("sin_cached", emb.sin()[None, None, :, :].to(dtype), persistent=False)

    def forward(self, x, seq_len=None):
        # x: [bs, num_attention_heads, seq_len, head_size]
        if seq_len > self.max_seq_len_cached:
            self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)

        return (
            self.cos_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
            self.sin_cached[:, :, :seq_len, ...].to(dtype=x.dtype),
        )


def rotate_half(x):
    """Rotates half the hidden dims of the input."""
    x1 = x[..., : x.shape[-1] // 2]
    x2 = x[..., x.shape[-1] // 2:]
    return torch.cat((-x2, x1), dim=-1)


def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
    # The first two dimensions of cos and sin are always 1, so we can `squeeze` them.
    print("shape q k cos sin:", q.shape, k.shape, cos.shape, sin.shape)
    cos = cos.squeeze(1).squeeze(0)  # [seq_len, dim]
    sin = sin.squeeze(1).squeeze(0)  # [seq_len, dim]
    cos = cos[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    sin = sin[position_ids].unsqueeze(1)  # [bs, 1, seq_len, dim]
    print("q shape:", q.shape)
    print("cos shape:", cos.shape)
    print("sin shape:", sin.shape)

    q_embed = (q * cos) + (rotate_half(q) * sin)
    k_embed = (k * cos) + (rotate_half(k) * sin)
    return q_embed, k_embed


def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
    """
    This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
    num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
    """
    batch, num_key_value_heads, slen, head_dim = hidden_states.shape
    if n_rep == 1:
        return hidden_states
    hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
    return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)


class CodeShellAttention(nn.Module):
    def __init__(
            self,
            config=CodeShellConfig,
            cache_config: Optional[CacheConfig] = None,
            quant_config: Optional[QuantizationConfig] = None,
            prefix: str = "", ):
        super().__init__()

        self.mask_value = None
        ####
        # self.num_key_value_groups = config.num_attention_heads // config.num_query_groups
        self.max_position_embeddings = config.max_position_embeddings
        self.rope_scaling = config.rope_scaling
        self.position_embedding_type = config.position_embedding_type
        self.num_query_groups = config.num_query_groups
        self.group_query_attention = config.group_query_attention
        self.hidden_size = config.hidden_size
        total_num_heads = config.num_attention_heads
        tensor_model_parallel_world_size = (
            get_tensor_model_parallel_world_size())
        assert total_num_heads % tensor_model_parallel_world_size == 0
        self.num_heads = total_num_heads // tensor_model_parallel_world_size
        self.head_dim = self.hidden_size // self.num_heads
        self.kv_heads = config.num_query_groups if self.group_query_attention else total_num_heads
        self.kv_dim = self.kv_heads * self.head_dim
        self.scale = self.head_dim ** -0.5
        self.c_attn = QKVParallelLinear(
            self.hidden_size,
            self.head_dim,
            total_num_heads,
            self.kv_heads,
            bias=True,
            quant_config=quant_config,
            prefix=f"{prefix}.c_attn",
        )
        self.c_proj = RowParallelLinear(
            self.hidden_size,
            self.hidden_size,
            bias=True,
            quant_config=quant_config,
            prefix=f"{prefix}.c_proj",
        )

        self.attn = Attention(self.num_heads,
                              self.head_dim,
                              scale=self.scale,
                              num_kv_heads=self.kv_heads,
                              cache_config=cache_config,
                              quant_config=quant_config)
        from vllm.model_executor.layers.rotary_embedding import get_rope
        max_positions = getattr(config, "seq_length", 8192)
        rope_ratio = getattr(config, "rope_ratio", 1.0)
        self.rotary_emb1 = get_rope(
            self.head_dim,
            rotary_dim=self.head_dim // 2,
            max_position=max_positions,
            base=10000 * rope_ratio,
            is_neox_style=False,
        )

        if self.position_embedding_type == "rope":
            self._init_rope()

    def _init_rope(self):
        if self.rope_scaling is None:
            self.rotary_emb = CodeShellRotaryEmbedding(self.head_dim,
                                                       max_position_embeddings=self.max_position_embeddings)
    def _get_mask_value(self, device, dtype):
        # torch.where expects a tensor. We use a cache to avoid recreating it every time.
        if self.mask_value is None or self.mask_value.dtype != dtype or self.mask_value.device != device:
            self.mask_value = torch.full([], torch.finfo(dtype).min, dtype=dtype, device=device)
        return self.mask_value

    def forward(
            self,
            hidden_states: torch.Tensor,
            position_ids: torch.Tensor,
            kv_cache: torch.Tensor,
            attn_metadata: AttentionMetadata,
    ) -> torch.Tensor:

        qkv, _ = self.c_attn(hidden_states)
        print(qkv.shape, hidden_states.shape)
        q, k, v = qkv.split([self.hidden_size, self.kv_dim, self.kv_dim], dim=-1)
        # q, k, v = qkv.chunk(chunks=3, dim=-1)
        # print("________q, k, v___________",q.shape, k.shape, v.shape)

        # query_states, key_states, value_states = self.c_attn(hidden_states).split(
        #     (self.hidden_size, self.kv_dim, self.kv_dim), dim=2)
        q, k = self.rotary_emb1(position_ids, q, k)
        if kv_cache is not None:
            print("____q,k____", q.shape, k.shape, kv_cache.shape)
        # kv_seq_len = k.shape[-2]
        # kv_seq_len = 1
        # cos, sin = self.rotary_emb(v, seq_len=kv_seq_len)

        # q, k = apply_rotary_pos_emb(q, k, cos, sin, position_ids)

        # k = repeat_kv(k, self.num_heads // self.kv_heads)
        # v = repeat_kv(v, self.num_heads // self.kv_heads)

        # attn_weights = torch.matmul(q, k.transpose(2, 3)) / math.sqrt(self.head_dim)
        # attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(q.dtype)
        # attn_weights = self.attn_dropout(attn_weights)
        # attn_output = torch.matmul(attn_weights, v)
        # attn_output = attn_output.transpose(1, 2).contiguous()
        # attn_output = attn_output.reshape(bsz, q_len, self.embed_dim)
        # attn_output = self.c_proj(attn_output)
        # attn_output = self.resid_dropout(attn_output)
        # outputs = (attn_output, layer_past)
        attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
        attn_output, _ = self.c_proj(attn_output)
        return attn_output


class CodeShellMLP(nn.Module):
    def __init__(
            self,
            intermediate_size: int,
            config: CodeShellConfig,
            quant_config: Optional[QuantizationConfig] = None,
            prefix: str = "",
    ):
        super().__init__()
        hidden_size = config.hidden_size
        self.c_fc = ColumnParallelLinear(
            hidden_size,
            intermediate_size,
            bias=True,
            quant_config=quant_config,
            prefix=f"{prefix}.c_fc",
        )
        self.c_proj = RowParallelLinear(
            intermediate_size,
            hidden_size,
            bias=True,
            quant_config=quant_config,
            prefix=f"{prefix}.c_proj",
        )
        self.act = get_act_fn(config.activation_function, quant_config,
                              intermediate_size)

    def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
        hidden_states, _ = self.c_fc(hidden_states)
        hidden_states = self.act(hidden_states)
        hidden_states, _ = self.c_proj(hidden_states)
        return hidden_states


class CodeShellBlock(nn.Module):
    def __init__(
            self,
            config: CodeShellConfig,
            cache_config: Optional[CacheConfig] = None,
            quant_config: Optional[QuantizationConfig] = None,
            prefix: str = "",
    ):
        super().__init__()
        hidden_size = config.hidden_size
        inner_dim = (config.n_inner if config.n_inner is not None else 4 *
                                                                       hidden_size)

        self.ln_1 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
        self.attn = CodeShellAttention(config,
                                       cache_config,
                                       quant_config,
                                       prefix=f"{prefix}.attn")
        self.ln_2 = nn.LayerNorm(hidden_size, eps=config.layer_norm_epsilon)
        self.mlp = CodeShellMLP(inner_dim,
                                config,
                                quant_config,
                                prefix=f"{prefix}.mlp")

    def forward(
            self,
            hidden_states: torch.Tensor,
            kv_cache: torch.Tensor,
            attn_metadata: AttentionMetadata,
            position_ids: torch.Tensor,
    ) -> torch.Tensor:
        residual = hidden_states
        hidden_states = self.ln_1(hidden_states)
        attn_output = self.attn(
            hidden_states=hidden_states,
            kv_cache=kv_cache,
            position_ids=position_ids,
            attn_metadata=attn_metadata,
        )
        # residual connection
        hidden_states = attn_output + residual

        residual = hidden_states
        hidden_states = self.ln_2(hidden_states)
        feed_forward_hidden_states = self.mlp(hidden_states)
        # residual connection
        hidden_states = residual + feed_forward_hidden_states
        return hidden_states


class CodeShellModel(nn.Module):
    def __init__(
            self,
            config: CodeShellConfig,
            cache_config: Optional[CacheConfig] = None,
            quant_config: Optional[QuantizationConfig] = None,
            prefix: str = "",
    ):
        super().__init__()
        self.config = config
        assert not config.add_cross_attention
        # self.group_query_attention = config.group_query_attention
        # self.num_query_groups = config.num_query_groups
        # self.position_embedding_type = config.position_embedding_type
        self.embed_dim = config.hidden_size
        self.wte = VocabParallelEmbedding(config.vocab_size, self.embed_dim)
        self.wpe = nn.Embedding(config.max_position_embeddings, self.embed_dim)
        self.start_layer, self.end_layer, self.h = make_layers(
            config.num_hidden_layers,
            lambda prefix: CodeShellBlock(
                config, cache_config, quant_config, prefix=prefix),
            prefix=f"{prefix}.h")
        self.ln_f = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_epsilon)

    def forward(
            self,
            input_ids: torch.Tensor,
            position_ids: torch.Tensor,
            kv_caches: List[torch.Tensor],
            attn_metadata: AttentionMetadata,
            intermediate_tensors: Optional[IntermediateTensors],
    ) -> Union[torch.Tensor, IntermediateTensors]:
        if get_pp_group().is_first_rank:
            inputs_embeds = self.wte(input_ids)
            position_embeds = self.wpe(position_ids)
            hidden_states = inputs_embeds + position_embeds
        else:
            assert intermediate_tensors is not None
            hidden_states = intermediate_tensors["hidden_states"]

        for i in range(self.start_layer, self.end_layer):
            layer = self.h[i]
            hidden_states = layer(hidden_states,
                                  kv_caches[i - self.start_layer],
                                  attn_metadata,
                                  position_ids=position_ids, )

        if not get_pp_group().is_last_rank:
            return IntermediateTensors({"hidden_states": hidden_states})

        hidden_states = self.ln_f(hidden_states)
        return hidden_states


class CodeShellForCausalLM(nn.Module):

    def __init__(
            self,
            config: CodeShellConfig,
            cache_config: Optional[CacheConfig] = None,
            quant_config: Optional[QuantizationConfig] = None,
    ):
        super().__init__()
        self.config = config
        self.quant_config = quant_config
        self.transformer = CodeShellModel(config,
                                          cache_config,
                                          quant_config,
                                          prefix="transformer")
        self.lm_head = self.transformer.wte
        self.logits_processor = LogitsProcessor(config.vocab_size)
        self.sampler = Sampler()

    def forward(
            self,
            input_ids: torch.Tensor,
            positions: torch.Tensor,
            kv_caches: List[torch.Tensor],
            attn_metadata: AttentionMetadata,
            intermediate_tensors: Optional[IntermediateTensors] = None,
    ) -> torch.Tensor:
        hidden_states = self.transformer(input_ids, positions, kv_caches,
                                         attn_metadata, intermediate_tensors)
        print("hidden_states", hidden_states.shape)
        return hidden_states

    def compute_logits(self, hidden_states: torch.Tensor,
                       sampling_metadata: SamplingMetadata) -> torch.Tensor:
        logits = self.logits_processor(self.lm_head, hidden_states,
                                       sampling_metadata)
        return logits

    def sample(
            self,
            logits: torch.Tensor,
            sampling_metadata: SamplingMetadata,
    ) -> Optional[SamplerOutput]:
        next_tokens = self.sampler(logits, sampling_metadata)
        return next_tokens

    def make_empty_intermediate_tensors(
            self, batch_size: int, dtype: torch.dtype,
            device: torch.device) -> IntermediateTensors:
        return IntermediateTensors({
            "hidden_states":
                torch.zeros((batch_size, self.config.hidden_size),
                            dtype=dtype,
                            device=device),
        })

    def load_weights(self, weights: Iterable[Tuple[str, torch.Tensor]]):
        params_dict = dict(self.named_parameters(remove_duplicate=False))
        for name, loaded_weight in weights:
            if "lm_head.weight" in name:
                # linear layer.
                continue
            if ".rotary_emb.inv_freq" in name:
                continue

            if ".attn.bias" in name or ".attn.masked_bias" in name:
                # Skip attention mask.
                # NOTE: "c_attn.bias" should not be skipped.
                continue
            if not name.startswith("transformer."):
                name = "transformer." + name

            if is_pp_missing_parameter(name, self):
                continue

            param = params_dict[name]
            # Because of this, we need to transpose the weights.
            # Note(zhuohan): the logic below might break quantized models.
            for conv1d_weight_name in ["c_attn", "c_proj", "c_fc"]:
                if conv1d_weight_name not in name:
                    continue
                if not name.endswith(".weight"):
                    continue
                # loaded_weight = loaded_weight.t()
            weight_loader = getattr(param, "weight_loader",
                                    default_weight_loader)
            weight_loader(param, loaded_weight)

###but running error

ORDER1:python3 -m vllm.entrypoints.openai.api_server --model /root/llms/CodeShell-7B-Chat --host 127.0.0.1 --port 12345 --trust-remote-code
ORDER2:curl 127.0.0.1:12345/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"/root/llms/CodeShell-7B-chat",
        "prompt":"hello. ",
        "temperature":0.7,
        "max_tokens":128}'

###error:

____q,k____ torch.Size([3, 4096]) torch.Size([3, 1024]) torch.Size([2, 1254, 16, 32, 128])
ERROR 12-24 05:55:33 async_llm_engine.py:57] Engine background task failed
ERROR 12-24 05:55:33 async_llm_engine.py:57] Traceback (most recent call last):
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 47, in _log_task_completion
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return_value = task.result()
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 642, in run_engine_loop
ERROR 12-24 05:55:33 async_llm_engine.py:57]     result = task.result()
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 585, in engine_step
ERROR 12-24 05:55:33 async_llm_engine.py:57]     request_outputs = await self.engine.step_async(virtual_engine)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 254, in step_async
ERROR 12-24 05:55:33 async_llm_engine.py:57]     output = await self.model_executor.execute_model_async(
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 159, in execute_model_async
ERROR 12-24 05:55:33 async_llm_engine.py:57]     output = await make_async(self.driver_worker.execute_model
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/concurrent/futures/thread.py", line 58, in run
ERROR 12-24 05:55:33 async_llm_engine.py:57]     result = self.fn(*self.args, **self.kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 273, in execute_model
ERROR 12-24 05:55:33 async_llm_engine.py:57]     output = self.model_runner.execute_model(
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return func(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 1363, in execute_model
ERROR 12-24 05:55:33 async_llm_engine.py:57]     hidden_or_intermediate_states = model_executable(
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/codeshell.py", line 460, in forward
ERROR 12-24 05:55:33 async_llm_engine.py:57]     hidden_states = self.transformer(input_ids, positions, kv_caches,
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/codeshell.py", line 422, in forward
ERROR 12-24 05:55:33 async_llm_engine.py:57]     hidden_states = layer(hidden_states,
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/codeshell.py", line 364, in forward
ERROR 12-24 05:55:33 async_llm_engine.py:57]     attn_output = self.attn(
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/model_executor/models/codeshell.py", line 293, in forward
ERROR 12-24 05:55:33 async_llm_engine.py:57]     attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return self._call_impl(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return forward_call(*args, **kwargs)
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/attention/layer.py", line 98, in forward
ERROR 12-24 05:55:33 async_llm_engine.py:57]     return self.impl.forward(query,
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/attention/backends/flash_attn.py", line 494, in forward
ERROR 12-24 05:55:33 async_llm_engine.py:57]     ops.reshape_and_cache_flash(
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/vllm/_custom_ops.py", line 572, in reshape_and_cache_flash
ERROR 12-24 05:55:33 async_llm_engine.py:57]     ops.reshape_and_cache_flash(key, value,
ERROR 12-24 05:55:33 async_llm_engine.py:57]   File "/usr/local/lib/python3.10/site-packages/ixformer/inference/functions/vllm.py", line 722, in reshape_and_cache_flash
ERROR 12-24 05:55:33 async_llm_engine.py:57]     ops.infer.vllm_reshape_and_cache(
ERROR 12-24 05:55:33 async_llm_engine.py:57] RuntimeError: Expected key_cache.numel() == num_blocks * num_heads * block_size * head_size to be true, but got false.  (Could this error message be improved?  If so, please report an enhancement request to PyTorch.)
ERROR 12-24 05:55:33 async_llm_engine.py:57] Exception raised from vllm_reshape_and_cache at /home/corex/sw_home/apps/ixformer/src/ixformer/infer/vllm.cpp:523 (most recent call first):
ERROR 12-24 05:55:33 async_llm_engine.py:57] frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f781e9a3687 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
ERROR 12-24 05:55:33 async_llm_engine.py:57] frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, char const*) + 0x68 (0x7f781e95ea66 in /usr/local/lib/python3.10/site-packages/torch/lib/libc10.so)
ERROR 12-24 05:55:33 async_llm_engine.py:57] frame #2: ixformer::infer::vllm_reshape_and_cache(at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, at::Tensor&, long, long) + 0xdb0 (0x7f76f82ab290 in /usr/local/lib/python3.10/site-packages/ixformer/libixformer.so)
ERROR 12-24 05:55:33 async_llm_engine.py:57] frame #3: <unknown function> + 0x40a299 (0x7f76f82a4299 in /usr/local/lib/python3.10/site-packages/ixformer/libixformer.so)
ERROR 12-24 05:55:33 async_llm_engine.py:57] frame #4: <unknown function> + 0x18fa5d (0x7f76f8029a5d in /usr/local/lib/python3.10/site-packages/ixformer/libixformer.so)
ERROR 12-24 05:55:33 async_llm_engine.py:57] <omitting python frames>
ERROR 12-24 05:55:33 async_llm_engine.py:57] 
Exception in callback _log_task_completion(error_callback=<bound method...7f76083e4af0>>)(<Task finishe...n frames>\n')>) at /usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py:37
handle: <Handle _log_task_completion(error_callback=<bound method...7f76083e4af0>>)(<Task finishe...n frames>\n')>) at /usr/local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py:37>

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@G1017 G1017 added the bug Something isn't working label Dec 24, 2024
@DarkLight1337
Copy link
Member

Is this a custom model? Please format your code properly using code blocks as it's difficult to read.

@DarkLight1337
Copy link
Member

What is the problem that you're running into? Please provide more context, not just an error message.

@G1017
Copy link
Author

G1017 commented Dec 24, 2024

I try to add a new model codeshell7b to vllm. I built codeshell.py, but encountered this error after running it. I don't know how to modify it.

@DarkLight1337 DarkLight1337 added usage How to use vllm and removed bug Something isn't working labels Dec 24, 2024
@DarkLight1337 DarkLight1337 changed the title [Bug]: add codeshell 7b model :RuntimeError: Expected key_cache.numel() == num_blocks * num_heads * block_size * head_size to be true, but got false. [Usage]: Trying to add codeshell 7b model, but got an error Dec 24, 2024
@DarkLight1337
Copy link
Member

Do you get this error before vLLM finishes starting up, or only during inference?

@G1017
Copy link
Author

G1017 commented Dec 24, 2024

Do you get this error before vLLM finishes starting up, or only during inference?

only during inference,can load model

root@:~# CUDA_VISIBLE_DEVICES=1 python3 -m vllm.entrypoints.openai.api_server --model /root/llms/CodeShell-7B-Chat --host 127.0.0.1 --port 12345 --trust-remote-code
INFO 12-24 05:52:32 importing.py:11] Triton not installed; certain GPU-related functions will be not be available.
INFO 12-24 05:52:39 api_server.py:339] vLLM API server version 0.5.4
INFO 12-24 05:52:39 api_server.py:340] args: Namespace(host='127.0.0.1', port=12345, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, model='/root/llms/CodeShell-7B-Chat', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', scheduler_delay_factor=0.0, enable_chunked_prefill=False, speculative_model=None, num_speculative_tokens=None, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)
WARNING 12-24 05:52:39 config.py:1454] Casting torch.bfloat16 to torch.float16.
INFO 12-24 05:52:39 llm_engine.py:174] Initializing an LLM engine (v0.5.4) with config: model='/root/llms/CodeShell-7B-Chat', speculative_config=None, tokenizer='/root/llms/CodeShell-7B-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=/root/llms/CodeShell-7B-Chat, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 12-24 05:52:39 model_runner.py:720] Starting to load model /root/llms/CodeShell-7B-Chat...
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:02<00:02,  2.48s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:06<00:00,  3.58s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:06<00:00,  3.42s/it]

INFO 12-24 05:52:47 model_runner.py:732] Loading model weights took 14.5477 GB
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
torch.Size([8192, 6144]) torch.Size([8192, 4096])
hidden_states torch.Size([8192, 4096])
INFO 12-24 05:52:49 gpu_executor.py:102] # GPU blocks: 1254, # CPU blocks: 390
WARNING 12-24 05:52:52 serving_embedding.py:171] embedding_mode is False. Embedding API will not work.
INFO 12-24 05:52:52 launcher.py:14] Available routes are:
INFO 12-24 05:52:52 launcher.py:22] Route: /openapi.json, Methods: HEAD, GET
INFO 12-24 05:52:52 launcher.py:22] Route: /docs, Methods: HEAD, GET
INFO 12-24 05:52:52 launcher.py:22] Route: /docs/oauth2-redirect, Methods: HEAD, GET
INFO 12-24 05:52:52 launcher.py:22] Route: /redoc, Methods: HEAD, GET
INFO 12-24 05:52:52 launcher.py:22] Route: /health, Methods: GET
INFO 12-24 05:52:52 launcher.py:22] Route: /tokenize, Methods: POST
INFO 12-24 05:52:52 launcher.py:22] Route: /detokenize, Methods: POST
INFO 12-24 05:52:52 launcher.py:22] Route: /v1/models, Methods: GET
INFO 12-24 05:52:52 launcher.py:22] Route: /version, Methods: GET
INFO 12-24 05:52:52 launcher.py:22] Route: /v1/chat/completions, Methods: POST
INFO 12-24 05:52:52 launcher.py:22] Route: /v1/completions, Methods: POST
INFO 12-24 05:52:52 launcher.py:22] Route: /v1/embeddings, Methods: POST
INFO:     Started server process [657772]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://127.0.0.1:12345 (Press CTRL+C to quit)
INFO 12-24 05:53:02 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.

@G1017
Copy link
Author

G1017 commented Dec 24, 2024

An error occurs only when executing the following command:

curl 127.0.0.1:12345/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"/root/llms/CodeShell-7B-chat",
        "prompt":"hello. ",
        "temperature":0.7,
        "max_tokens":128}'

@DarkLight1337
Copy link
Member

I see that you're using an older version of vLLM. Can you try upgrading vLLM to make sure your dependencies are up-to-date?

@G1017
Copy link
Author

G1017 commented Dec 24, 2024

No, my environment does not support newer vllm. Could you please try to see if there is any problem with my code?

我发现您使用的是旧版本的 vLLM。您可以尝试升级 vLLM 以确保您的依赖项是最新的吗?

@DarkLight1337
Copy link
Member

Can you run collect_env.py to show your environment information?

@G1017
Copy link
Author

G1017 commented Dec 24, 2024

Collecting environment information...
PyTorch version: 2.1.1
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 10.5.0-1ubuntu1~20.04) 10.5.0
CMake version: version 3.25.2-corex.4.1.2
Libc version: glibc-2.31

Python version: 3.10.12 (main, Aug 16 2024, 18:39:09) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-148-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.2.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Could not collect
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 57 bits virtual
CPU(s):                          112
On-line CPU(s) list:             0-111
Thread(s) per core:              2
Core(s) per socket:              28
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           106
Model name:                      Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz
Stepping:                        6
CPU MHz:                         800.000
CPU max MHz:                     3100.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4000.00
Virtualization:                  VT-x
L1d cache:                       2.6 MiB
L1i cache:                       1.8 MiB
L2 cache:                        70 MiB
L3 cache:                        84 MiB
NUMA node0 CPU(s):               0-27,56-83
NUMA node1 CPU(s):               28-55,84-111
Vulnerability Itlb multihit:     Not affected
Vulnerability L1tf:              Not affected
Vulnerability Mds:               Not affected
Vulnerability Meltdown:          Not affected
Vulnerability Mmio stale data:   Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq rdpid md_clear pconfig flush_l1d arch_capabilities

Versions of relevant libraries:
[pip3] mypy-protobuf==3.6.0
[pip3] numpy==1.26.4
[pip3] torch==2.1.1+corex.4.1.2
[pip3] torch_cluster==1.6.0+corex.4.1.2
[pip3] torch_quiver==0.1.0+corex.4.1.2
[pip3] torch_scatter==2.1.0+corex.4.1.2
[pip3] torch_sparse==0.6.16+corex.4.1.2
[pip3] torchaudio==2.1.0+corex.4.1.2
[pip3] torchdebug==0.1.0+corex.4.1.2
[pip3] torchvision==0.16.0+corex.4.1.2
[pip3] triton==2.1.0+corex.4.1.2
[conda] Could not collect

@DarkLight1337
Copy link
Member

For vLLM 0.5.4, you should have PyTorch 2.4.0. See https://github.com/vllm-project/vllm/blob/v0.5.4/requirements-cuda.txt

@G1017
Copy link
Author

G1017 commented Dec 24, 2024

But other integrated models can run normally, such as qwen.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 24, 2024

I rule out incompatibilities in the dependencies first, before looking further into the code. Otherwise, it would be difficult for me to recreate your environment - how could I help you then?

@G1017
Copy link
Author

G1017 commented Dec 24, 2024

Since my device is not NVIDIA, the torch in my environment supports this version of vllm. Please help me check the code. The transformer code of codeshell7b is referenced at https://github.com/WisdomShell/codeshell/blob/09d1adc88ccada1a92924c69ece0cf0e73899b1b/model/modeling_codeshell.py

@G1017
Copy link
Author

G1017 commented Dec 24, 2024

model path:https://huggingface.co/WisdomShell/CodeShell-7B-Chat#model-details;I look forward to your help in solving the above problems. Thank you very much.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 24, 2024

When trying to run this model in the latest version of vLLM, I found that there is no .wpe in this model, so you should remove it. There is also a redundant rope implementation in the attention layer.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Dec 24, 2024

OK, after reusing some of the code from the existing models, I'm able to run the model on my branch using the latest version of vLLM (https://github.com/DarkLight1337/vllm/tree/codeshell) without crashing.

However the output is still garbled, probably need to step through the results with a debugger and compare with HF to see where it goes wrong.

@G1017
Copy link
Author

G1017 commented Dec 25, 2024

OK, after reusing some of the code from the existing models, I'm able to run the model on my branch using the latest version of vLLM (https://github.com/DarkLight1337/vllm/tree/codeshell) without crashing.

However the output is still garbled, probably need to step through the results with a debugger and compare with HF to see where it goes wrong.

Thank you very much for your help. I am also having the garbled problem now. I will try to modify it first.

@G1017
Copy link
Author

G1017 commented Dec 30, 2024

OK, after reusing some of the code from the existing models, I'm able to run the model on my branch using the latest version of vLLM (https://github.com/DarkLight1337/vllm/tree/codeshell) without crashing.

However the output is still garbled, probably need to step through the results with a debugger and compare with HF to see where it goes wrong.

Do you have any new progress on garbled characters?

@DarkLight1337
Copy link
Member

No, I thought you were looking into this. I am working on other PRs.

@DarkLight1337
Copy link
Member

Refer to #11681

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

2 participants