-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MODEL REQUESTS #69
Comments
A gemma-2-27b-it in 8 bits for both |
Thanks - looking for fp8 for H100 and int8 for A100? |
Exactly! |
Can you share more about the issue you were seeing? |
I'm getting empty generations and unserializeable logits, indicating NaNs in model outputs. recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
targets: ["Linear"]
""" |
Could be a FlashInfer issue. Ill work on an example for you |
Hi @robertgshaw2-neuralmagic , could we get an update to https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8 ? The main model https://huggingface.co/mistralai/Mixtral-8x22B-Instruct-v0.1 had its tokenizer updated recently and it would be great to incorporate these into the quantized model. |
Hi ! |
Absolutely @Lin-K76 - could you update this when you have a chance this week |
We can take a look at this, adding support for Vision models is on our roadmap but we need to try it out a bit more. |
@BlackSamorez - I made a couple examples with Note: Here's install instructions on the vllm side: export VLLM_VERSION=0.5.4
pip install [https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-${VLLM_VERSION}-cp38-abi3-manylinux1_x86_64.whl](https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-$%7BVLLM_VERSION%7D-cp38-abi3-manylinux1_x86_64.whl)
pip install lm_eval==0.4.3
pip install https://github.com/flashinfer-ai/flashinfer/releases/download/v0.1.2/flashinfer-0.1.2+cu121torch2.4-cp310-cp310-linux_x86_64.whl Eval MODEL=google/gemma-2-27b-it
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=$MODEL,add_bos_token=true --tasks gsm8k --num_fewshot 5 --limit 250 --batch_size "auto" vllm (pretrained=google/gemma-2-27b-it,add_bos_token=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.864|± |0.0217|
| | |strict-match | 5|exact_match|↑ |0.848|± |0.0228| Eval MODEL=gemma-2-27b-it-FP8-Dynamic
VLLM_ATTENTION_BACKEND=FLASHINFER lm_eval --model vllm --model_args pretrained=$MODEL,add_bos_token=true --tasks gsm8k --num_fewshot 5 --limit 250 --batch_size "auto" vllm (pretrained=gemma-2-27b-it-FP8-Dynamic,add_bos_token=true), gen_kwargs: (None), limit: 250.0, num_fewshot: 5, batch_size: auto
|Tasks|Version| Filter |n-shot| Metric | |Value| |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.856|± |0.0222|
| | |strict-match | 5|exact_match|↑ |0.852|± |0.0225| The We will push a model up to the hub later this week once we have a chance to QA it. |
Hi, the new model is now live at https://huggingface.co/neuralmagic/Mixtral-8x22B-Instruct-v0.1-FP8. |
Thanks @Lin-K76 ! |
Qwen2 series in Oneshot with 2:4 sparse or GPTQ alone is fine, but not both. Do I need to change my calibration dataset or GPTQ config? |
Thanks @yzlnew, I will take a look. My suggestion though would be to use the W8A8 (int8 on ampere / fp8 on hopper) for production use cases as this will give you the best recovery and performance right now. We are still working on making sparsity better. I will work on a demo for you later this week though :) |
the Hermes 3 70b in int4 could be very great! |
neuralmagic/DeepSeek-Coder-V2-Instruct-FP8 is great ! How about DeepSeek-Coder-V2-Instruct in W8A8(INT8) ? I think DeepSeek-Coder-V2-Instruct-W8A8 could be great ! Or any instructions help me to quantinize DeepSeek-Coder-V2-Instruct to W8A8(INT8) ? |
Hello! Currently in vllm, we only support FP8 inference for MoE models. We are about to add support for W4A16 (PR is landing ideally today/tomorrow) and will follow up with W8A16. We currently do not have an active plan for W8A8, but can consider this on our roadmap. |
Hi, can I please ask for a gemma-2-27b-int8? It's a good fit for 48GB cards and I'd love to run it with vLLM. Many quantization methods seem broken for this model unfortunately... would really appreciate it! |
DeepSeek-Coder-V2-Instruct in W4A16 would be great! Looking forward to your model release. |
I tried to quantize deepseek-coder-v2 to w4a16, but the following error occurred. |
What is your transformers version? Also - note that quantization support for MoEs is still under construction in vllm. |
Do you mean this PR #7766 ? for W4A16 ? @robertgshaw2-neuralmagic |
I see, I forgot to set trust_remote_code=True. |
yes |
Release v0.5.6 will support it. Need this PR: vllm-project/vllm#7766 |
Is this PR still in progress? Do you have an estimated timeline? |
@robertgshaw2-neuralmagic I use this framework with 512 data points to calibrate the quantized deepseek-v2.5 model. The output result is "!!". Are there any tricks for quantizing this model? from llmcompressor.transformers import SparseAutoModelForCausalLM
from transformers import AutoTokenizer
import argparse
from typing import Dict, Union
from llmcompressor.transformers import oneshot
from llmcompressor.modifiers.quantization import GPTQModifier
import psutil
import torch
from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoModelForCausalLM
import flash_attn
from datasets import load_dataset
print(flash_attn.__version__)
def custom_offload_device_map(
model_stub: str,
max_memory_per_gpu: Union[str, int],
max_memory_gpu0: Union[str, int],
num_gpus: int = 1,
offload_buffers: bool = False,
**model_kwargs,
) -> Dict[Union[int, str], Union[int, str]]:
"""
Calculates the optimal gpu mappings for model_stub stored as torch_dtype, where
each GPU is restricted to allocating a specific amount of memory.
:param model_stub: local path or HF stub to calculate mapping for
:param max_memory_per_gpu: Max memory to allocate on each GPU, as either a string
such as "10GB" or an integer number of bytes
:param num_gpus: number of gpus to utilize
:param model_kwargs: keyword arguments to pass to model initializer
:return: memory mapping for layers of model_stub to be passed to from_pretrained()
"""
max_cpu_memory = psutil.virtual_memory().available
memory_limits = {device: max_memory_per_gpu for device in range(1, num_gpus)}
memory_limits[0] = max_memory_gpu0
memory_limits["cpu"] = max_cpu_memory
with init_empty_weights():
dummy_model = AutoModelForCausalLM.from_pretrained(model_stub, **model_kwargs)
device_map = infer_auto_device_map(
dummy_model,
max_memory=memory_limits,
no_split_module_classes=dummy_model._no_split_modules,
offload_buffers=offload_buffers
)
del dummy_model
return device_map
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model-id", type=str, default="/opt/tiger/deepseek_http/models--deepseek-ai--DeepSeek-V2.5")
parser.add_argument("--dataset-dir", type=str,
default="/opt/tiger/deepseek_http/datasets--HuggingFaceH4--ultrachat_200k")
parser.add_argument("--max-memory-per-gpu", type=str, default="52GB")
parser.add_argument("--max-memory-gpu0", type=str, default="52GB")
parser.add_argument("--device-map", type=str, default='auto')
parser.add_argument("--num-samples", type=int, default=512)
parser.add_argument("--offload-buffers", action='store_true')
parser.add_argument("--max-model-len", type=int, default=8192)
parser.add_argument("--sequential-update", action='store_true')
parser.add_argument("--dataset-split", type=str, default='train_sft')
args = parser.parse_args()
# Select calibration dataset.
DATASET_ID = args.dataset_dir
DATASET_SPLIT = args.dataset_split
MAX_SEQUENCE_LENGTH = args.max_model_len
NUM_CALIBRATION_SAMPLES = args.num_samples
# Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
tokenizer = AutoTokenizer.from_pretrained(args.model_id)
def preprocess(example):
if 'messages' in example:
messages = example['messages']
elif 'input' in example and 'output' in example:
messages = [
{
"role": "user",
"content": example['input']
},
{
"role": "assistant",
"content": example['output']
}
]
else:
raise ValueError("in valid example")
return {
"text": tokenizer.apply_chat_template(
messages,
tokenize=False,
)
}
ds = ds.map(preprocess)
# Tokenize inputs.
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(tokenize, remove_columns=ds.column_names)
# define a llmcompressor recipe for W8A8 quantization
recipe = GPTQModifier(
targets="Linear", scheme="W4A16", ignore=["lm_head"], sequential_update=args.sequential_update
)
if args.device_map == "cpu":
model = SparseAutoModelForCausalLM.from_pretrained(
args.model_id, device_map="cpu", torch_dtype=torch.bfloat16, trust_remote_code=True
)
else:
device_map = custom_offload_device_map(
model_stub=args.model_id,
max_memory_per_gpu=args.max_memory_per_gpu,
max_memory_gpu0=args.max_memory_gpu0,
num_gpus=8,
offload_buffers=args.offload_buffers,
trust_remote_code=True
)
model = SparseAutoModelForCausalLM.from_pretrained(
args.model_id, device_map=device_map, torch_dtype=torch.bfloat16, trust_remote_code=True
)
SAVE_DIR = args.model_id + '-W4A16'
oneshot(
model=model, dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)
# Save to disk compressed.
model.save_pretrained(SAVE_DIR, save_compressed=True,
skip_compression_stats=True)
tokenizer.save_pretrained(SAVE_DIR) |
Thanks @fengyang95 - @dsikka is looking into this |
Hey @fengyang95 - investigating this issue. Will update once fixed. |
Hi @fengyang95 - can you share the code you're using which generates We have also added this example which you can follow: You'll need to use the latest main to pull in a fix that was needed for deepseek_v2 |
python3 -m vllm.entrypoints.openai.api_server --model DeepSeek-V2.5-W4A16 ---served-model-name dsv2 --trust-remote-code --tensor-parallel-size 8 --max-model-len 16384 --port $PORT0 --gpu-memory-utilization 0.9 --quantization compressed-tensors --force-eager |
Thank you, I'll try it right away. |
Hi @dsikka , I followed your suggestion to ignore the gate parameter and updated the code. However, the quantized model still outputs "!!!". Have you tested this on DeepSeek-v2.5? |
Hi @fengyang95 there was a bug in vLLM which has now been fixed on main. Do you mind trying it again? |
I'll try it asap |
I am getting the following error while trying to run https://huggingface.co/nm-testing/DeepSeek-V2.5-W4A16 Process SpawnProcess-1:
Traceback (most recent call last):
File "/vllm/vllm/worker/model_runner_base.py", line 112, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/worker/model_runner.py", line 1547, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 504, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 461, in forward
hidden_states, residual = layer(positions, hidden_states,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 401, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 148, in forward
final_hidden_states = self.experts(
^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 469, in forward
final_hidden_states = self.quant_method.apply(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 285, in apply
return fused_marlin_moe(
^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 150, in fused_marlin_moe
assert hidden_states.dtype == torch.float16
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 242, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
self.engine = AsyncLLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/engine/async_llm_engine.py", line 576, in from_engine_args
engine = cls(
^^^^
File "/vllm/vllm/engine/async_llm_engine.py", line 471, in __init__
self.engine = self._engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/engine/async_llm_engine.py", line 260, in __init__
super().__init__(*args, **kwargs)
File "/vllm/vllm/engine/llm_engine.py", line 331, in __init__
self._initialize_kv_caches()
File "/vllm/vllm/engine/llm_engine.py", line 465, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/worker/worker.py", line 223, in determine_num_available_blocks
self.model_runner.profile_run()
File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/worker/model_runner.py", line 1219, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/worker/model_runner_base.py", line 126, in _wrapper
raise type(err)(
AssertionError: Error in model execution (input dumped to /tmp/err_execute_model_input_20240917-022954.pkl):
ERROR 09-17 02:30:01 api_server.py:203] RPCServer process died before responding to readiness probe
/usr/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d ' The command I ran: |
Hi @TheAhmadOsman - the current kernel supports float16. Could you pass that in for |
@dsikka running Process SpawnProcess-1:
Traceback (most recent call last):
File "/vllm/vllm/worker/model_runner_base.py", line 116, in _wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/worker/model_runner.py", line 1590, in execute_model
hidden_or_intermediate_states = model_executable(
^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 504, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 461, in forward
hidden_states, residual = layer(positions, hidden_states,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 401, in forward
hidden_states = self.mlp(hidden_states)
^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/models/deepseek_v2.py", line 148, in forward
final_hidden_states = self.experts(
^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/layers/fused_moe/layer.py", line 469, in forward
final_hidden_states = self.quant_method.apply(
^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors_moe.py", line 285, in apply
return fused_marlin_moe(
^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/layers/fused_moe/fused_marlin_moe.py", line 171, in fused_marlin_moe
sorted_token_ids, _, _ = moe_align_block_size(topk_ids, block_size_m, E)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/model_executor/layers/fused_moe/fused_moe.py", line 228, in moe_align_block_size
ops.moe_align_block_size(topk_ids, num_experts, block_size, sorted_ids,
File "/vllm/vllm/_custom_ops.py", line 32, in wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/_custom_ops.py", line 800, in moe_align_block_size
torch.ops._C.moe_align_block_size(topk_ids, num_experts, block_size,
File "/vllm/venv/lib/python3.11/site-packages/torch/_ops.py", line 1061, in __call__
return self_._op(*args, **(kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/usr/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.11/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 242, in run_rpc_server
server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
self.engine = AsyncLLMEngine.from_engine_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/engine/async_llm_engine.py", line 576, in from_engine_args
engine = cls(
^^^^
File "/vllm/vllm/engine/async_llm_engine.py", line 471, in __init__
self.engine = self._engine_class(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/engine/async_llm_engine.py", line 260, in __init__
super().__init__(*args, **kwargs)
File "/vllm/vllm/engine/llm_engine.py", line 331, in __init__
self._initialize_kv_caches()
File "/vllm/vllm/engine/llm_engine.py", line 465, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/executor/distributed_gpu_executor.py", line 39, in determine_num_available_blocks
num_blocks = self._run_workers("determine_num_available_blocks", )
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/executor/multiproc_gpu_executor.py", line 185, in _run_workers
driver_worker_output = driver_worker_method(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/worker/worker.py", line 223, in determine_num_available_blocks
self.model_runner.profile_run()
File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/worker/model_runner.py", line 1236, in profile_run
self.execute_model(model_input, kv_caches, intermediate_tensors)
File "/vllm/venv/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/vllm/vllm/worker/model_runner_base.py", line 144, in _wrapper
raise type(err)(
RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20240917-230234.pkl): CUDA error: invalid argument
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.``` |
@dsikka I just noticed that |
@dsikka any thoughts? I'd appreciate any pointers |
Hi @TheAhmadOsman - the following command worked for me. Do you mind trying it?
There was a bug introduced in vllm recently so you'll have to wait until the following bug fix lands: |
I see a lot of Deepseek 2.5 discussion here. I'm very interested in an FP8 version so we can deploy on some H100's optimally. Appreciate all the work this team has done! |
Hi ! Thanks in advance |
for byroneverson/internlm2_5-20b-chat-abliterated, can you quant it to w8a8?
still failed, sad. |
Would the fp8 models published by Neural Magic work with tpu_int8 quantization in vLLM? This is the error I get:
Should I try to publish int8 models and would those possibly work with compressend-tensors? error I get when letting it choose quantizaiton method:
|
@Syst3m1cAn0maly we have a Phi 3.5 vision models that was quantized to FP8, you can try it here https://huggingface.co/nm-testing/Phi-3.5-vision-instruct-FP8-dynamic @samos123 for INT8 backends you must use INT8 models, maybe you can try some of the w8a8 models here https://huggingface.co/collections/neuralmagic/int8-llms-for-vllm-668ec32c049dca0369816415 |
@mgoin I tried the w8a8 model as well but there is no TPU support for compressed tensors just yet, so it didn't work. I got an error saying exactly that. The PR by @robertgshaw2-neuralmagic should fix this though: vllm-project/vllm#9301 |
TPUs do not support fp8 quantization for acceleration. So we are focusing on:
Ideally will have something ready for next release. TBD on perf |
Hey @rahul-tuli - could you provide some guidance on the SmoothQuant mappings here? |
can't quant CohereForAI/aya-expanse-8b to W8A8, used latest llmcompressor_dev-0.2.0.dev0 code and python setup.py installed:
|
I tried to manually locate post_attention_layernorm as described in llmcompressor/modifiers/smoothquant/README.md, but I couldn't. There is no post_attention_layernorm or anything with similar name in model.safetensors.index.json, and I can't find any comments related to "post" in transformers\models\cohere\modeling_cohere.py. |
Please comment here any model requests for:
llm-compressor
The text was updated successfully, but these errors were encountered: