API causes slowdown in batch request handling #1707

jpeig · 2023-11-17T16:57:04Z

Using the API server and submitting multiple prompts to take advantage of speed benefit returns the following error:

"multiple prompts in a batch is not currently supported"

What's the point of vLLM without being able to send batches to the API?

Of course, I can send multiple seperate requests, but those are handled sequentially and do not benefit from speed improvements.

Correct me if I'm wrong...

jpeig · 2023-11-17T17:09:32Z

#1636

simon-mo · 2023-11-17T17:14:05Z

Of course, I can send multiple seperate requests, but those are handled sequentially and do not benefit from speed improvements.

This is not correct. vLLM automatically batches in-flight requests. It is built for the use case of high concurrency of requests. This means, when you are sending multiple individual requests, the underlying engine running in the server perform batching.

simon-mo · 2023-11-17T18:51:25Z

Further illustrated here, hope the explanation is helpful: #1636 (comment)

jpeig · 2023-11-18T10:53:34Z

It is thank you for the elaborated answer.

simon-mo · 2023-11-18T17:45:44Z

Ah one more thing, if you observing sequential behavior, try correct main branch instead of released version. Or turn on the flag --engine-use-ray. In the released version, our AsyncLLMEngine is single threaded and there's a fairly small chance concurrent queries don't get picked up, due to unfairness in Python asyncio.

This should be fixed as we work on #1677

jpeig · 2023-11-19T14:59:02Z

@simon-mo I'm using asyncio.gather to approach the API (calling the acreate function) so AsyncLLMEngine should be able to handle the queries concurrently. However I am still experiencing semi-sequential behavior whereby requests get sequentially added to the queue with seconds delay in between. I'll try out the main branch.

simon-mo · 2023-11-19T17:38:55Z

v0.2.2 was released last night. It should include the change. Please try it out and let us know!

jpeig · 2023-11-21T00:11:30Z

@simon-mo

I'm on main branch (latest).

I still notice 0.5 to 1 seconds between each request being added to the queue.
In the meantime no requests are being processed.

Only after all requests have been added, they do execute concurrently.

Is this expected behavior?

jpeig · 2023-11-21T00:12:56Z

INFO 11-21 01:07:22 async_llm_engine.py:370] Received request cmpl-78ae9b5f36b241c0b64131e838f2a85f
INFO 11-21 01:07:23 async_llm_engine.py:370] Received request cmpl-cb5e63e3f8d64d8a9b39afc7d9147c5b
INFO 11-21 01:07:24 async_llm_engine.py:370] Received request cmpl-db56d6b1a0f94ef7990f0eb21b98fcd1

etc...

Because I have sent quite a large of requests to the API, no requests are processed by vllm for 14 seconds (no GPU load).

simon-mo · 2023-11-21T00:39:24Z

Did you turn on engine-use-ray?

jajj50386 · 2023-11-21T17:52:01Z

@jpeig I have the same problem, when sending for example 10 request concurrently then vllm wait around 10 seconds to start generating output for each requests. If in the middle of generation I send a new request then all requests (generating output) stop until this new request is handled.

@simon-mo I have used --engine-use-ray also, no changes. (api_server with stream)

jpeig · 2023-11-22T10:15:09Z

Yes that's the same behavior. I am using the OpenAI server. What about you? @jajj50386

jajj50386 · 2023-11-22T21:23:56Z

Yes that's the same behavior. I am using the OpenAI server. What about you? @jajj50386

@jpeig I am using api_server
here

tom-doerr · 2023-11-23T00:42:32Z

Same issue here, I'm using the OpenAI API.
Here's how I started the server:

python3 -m vllm.entrypoints.openai.api_server --model TheBloke/Xwin-LM-70B-V0.1-AWQ --quantization awq --dtype half --tensor-parallel-size 2 --port 8427 --gpu-memory-utilization 0.6 --engine-use-ray

Version 0.2.2

$ pip freeze | grep vllm
vllm==0.2.2

That's pretty disappointing, just spent a few hours rewriting my code to send the requests in parallel and there is no speedup.

The output that I get when I start the server:

./start_llm_server.sh           
INFO 11-23 01:31:19 api_server.py:638] args: Namespace(host=None, port=84
27, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*']
, allowed_headers=['*'], served_model_name=None, model='TheBloke/Xwin-LM-
70B-V0.1-AWQ', tokenizer=None, revision=None, tokenizer_revision=None, to
kenizer_mode='auto', trust_remote_code=False, download_dir=None, load_for
mat='auto', dtype='half', max_model_len=None, worker_use_ray=False, pipel
ine_parallel_size=1, tensor_parallel_size=2, block_size=16, seed=0, swap_
space=4, gpu_memory_utilization=0.6, max_num_batched_tokens=None, max_num
_seqs=256, max_paddings=256, disable_log_stats=False, quantization='awq',
 engine_use_ray=True, disable_log_requests=False, max_log_len=None)      
WARNING 11-23 01:31:19 config.py:140] awq quantization is not fully optim
ized yet. The speed can be slower than non-quantized models.             
2023-11-23 01:31:22,698 INFO worker.py:1633 -- Started a local Ray instan
ce. View the dashboard at 127.0.0.1:8265                                 
(_AsyncLLMEngine pid=1597004) INFO 11-23 01:31:26 llm_engine.py:72] Initi
alizing an LLM engine with config: model='TheBloke/Xwin-LM-70B-V0.1-AWQ',
 tokenizer='TheBloke/Xwin-LM-70B-V0.1-AWQ', tokenizer_mode=auto, revision
=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.floa
t16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parall
el_size=2, quantization=awq, seed=0)                                     
(_AsyncLLMEngine pid=1597004) Using blocking ray.get inside async actor. 
This blocks the event loop. Please use `await` on object ref with asyncio
.gather if you want to yield execution to the event loop instead.        
(_AsyncLLMEngine pid=1597004) INFO 11-23 01:31:46 llm_engine.py:207] # GP
U blocks: 1770, # CPU blocks: 1638                                       
INFO:     Started server process [1592084]                               
INFO:     Waiting for application startup.                               
INFO:     Application startup complete.                                  
INFO:     Uvicorn running on http://0.0.0.0:8427 (Press CTRL+C to quit)  
INFO:     127.0.0.1:35786 - "GET /v1/models HTTP/1.1" 200 OK

jpeig · 2023-11-23T17:52:45Z

Talking about disappointing, I also rewrote my application to be able to support vllm and concurrent requests as opposed of using exllama + my own API (without vllm).

But @simon-mo is working on it.
And I noticed quite a lof of people complaining about delayed responses.

simon-mo · 2023-11-23T18:10:58Z

Sorry about the issue and we are treating it with high priority. We are in the process of reproducing the bug on different kinds of settings. As posted before, our original online tests have demonstrated full saturation with batching behavior.

vLLM is designed for high throughput scenario for both online and offline scenarios.

tom-doerr · 2023-11-23T21:33:11Z

@simon-mo Thank you! really like all other aspects of vllm so far. If you need help reproducing it I'm happy to help. I attached the versions of the packages in my python env in case that helps:
python_env.txt

Some more output from the server:

(_AsyncLLMEngine pid=804494) INFO 11-23 22:35:55 llm_engine.py:624] Avg p
rompt throughput: 555.5 tokens/s, Avg generation throughput: 0.2 tokens/s
, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 
4.1%, CPU KV cache usage: 0.0%

yungangwu · 2023-11-24T02:47:39Z

Sorry about the issue and we are treating it with high priority. We are in the process of reproducing the bug on different kinds of settings. As posted before, our original online tests have demonstrated full saturation with batching behavior.

vLLM is designed for high throughput scenario for both online and offline scenarios.

When Vllm is running in API mode, I tried to make concurrent streaming calls, but some of the requests sent concurrently would wait for a considerable amount of time before receiving the results. I wanted to achieve a batch processing-like effect, where 4-8 concurrent data received could be uniformly processed without significant delays between them.

What I did was to batch the received API requests and then concurrently open batch size AsyncLLMEngine inferences for a batch of data. From the actual results, this approach can indeed receive replies faster for all calls.

However, I am not sure if this approach actually helps with the inference speed or if it is better to use the native API call directly.

tom-doerr · 2023-11-26T03:58:17Z

Any idea how long it might take to fix this or if there is a chance we can fix it ourselves?

simon-mo · 2023-11-26T04:27:05Z

My conservative ETA is EOW (12/3). If you want to help look into as well, more help the better!

On November 25, 2023, GitHub ***@***.***> wrote: Any idea how long it might take to fix this or if there is a chance we can fix it ourselves?

— Reply to this email directly, view it on GitHub <#1707 (comment)- 1826482378>, or unsubscribe <https://github.com/notifications/unsubscribe- auth/AFBD7A5WFRV677MDROGS6F3YGK46HAVCNFSM6AAAAAA7QCOZIWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMRWGQ4DEMZXHA>. You are receiving this because you were mentioned.Message ID: <vllm- ***@***.***>

tom-doerr · 2023-11-30T13:12:00Z

Generating multiple completions in parallel also only works efficiently if there are no other requests. With other requests the completion time goes from ~10 seconds to ~120 seconds for n=30.

jpeig · 2023-11-30T13:36:54Z

@tom-doerr

Exactly! When I force add a new request by bypassing the API, I noticed that it works efficiently as well. That's why I initially assumed the default approach is to batch prompts in a single request (which wasn't supported).

@simon-mo

This insight may help resolve the issue.

simon-mo · 2023-12-01T08:49:22Z

Ok I spent some times on different rabbit holes. The end conclusion is as following, you are seeing undesirable performance because vLLM's under-optimized support for AWQ models at the moment. I would recommend using the non-quantized version (and smaller if size doesn't fit) for now: not only you will get better accuracy, you will also get better performance. It still works for low throughput use case, delivering lower latency and memory savings.

You should also see this warning in the output, what you are observing is the effect of this:

WARNING 12-01 08:25:34 config.py:140] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.

Currently vLLM process the allow prompt ("prefill") to "skip the line" of decoding cycles, so that we can further saturate the GPU utilization for later decoding stages by bringing in the new requests. However, due to the poor performance of AWQ, the prefill processing is very slow, therefore further slowing things down, letting you observing that batching is not in effect, and decoding is not in parallel. You can learn more about how vLLM's current approach compare here in Microsoft DeepSpeed's post.

See more detail here: #1032 (comment), quoting @WoosukKwon

Throughput of FP16 LLaMA-7B:
Throughput: 6.28 requests/s, 3003.81 tokens/s
Throughput of AWQ LLaMA-7B (casperhansen/vicuna-7b-v1.5-awq):
Throughput: 4.23 requests/s, 2022.94 tokens/s
Probably because the AWQ kernel is not well optimized (e.g., we use a single kernel for different shapes and hardware), the throughput decreases rather than increases.

The root cause is vLLM doesn't have well-tuned AWQ CUDA kernels for different shapes and hardware. We are planning to experiment with Triton compilation for better kernel. The original kernels we adapted from AWQ repo is optimized for resource-constrained hardware like NVIDIA Jetson Orin.

We will fix this, as quantization will be a vital part of the the LLM inference stack. However, creating optimized kernels for different hardware configuration is non-trivial.

To address this, I'm updating the docs in #1883. I'm also starting to see whether bringing new newer version of AWQ kernel will have higher performance #1882. Lastly, the scheduling algorithms letting prefill to skip the line is not always the best approach, especially in the case of long prompt. We are working on getting a version of chunked prefill into vLLM as well.

Finally, I want to thank your patience and support of vLLM, as we work through performance issues and bugs.

jpeig · 2023-12-01T14:34:24Z

@simon-mo

Thank you for the your response but AWQ does not appear to be the issue.

I tested 15 prompts without AWQ quantization, and I still get 0.5-1 second between handling each request.
After the requests are handled, it starts processing the requests.

I can 'fix' the issue by not using the API and directly adding the requests - as @tom-doerr has said.

With a batch of 15 prompts, I experience a slowdown of roughly 10 seconds because of this.

So this is not an AWQ issue but an API / request handling issue.

jajj50386 · 2023-12-01T14:41:51Z

Sorry about the issue and we are treating it with high priority. We are in the process of reproducing the bug on different kinds of settings. As posted before, our original online tests have demonstrated full saturation with batching behavior.
vLLM is designed for high throughput scenario for both online and offline scenarios.

When Vllm is running in API mode, I tried to make concurrent streaming calls, but some of the requests sent concurrently would wait for a considerable amount of time before receiving the results. I wanted to achieve a batch processing-like effect, where 4-8 concurrent data received could be uniformly processed without significant delays between them.

What I did was to batch the received API requests and then concurrently open batch size AsyncLLMEngine inferences for a batch of data. From the actual results, this approach can indeed receive replies faster for all calls.

However, I am not sure if this approach actually helps with the inference speed or if it is better to use the native API call directly.

@yungangwu could you share the code?

simon-mo · 2023-12-01T17:23:04Z

I tested 15 prompts without AWQ quantization, and I still get 0.5-1 second between handling each request.
After the requests are handled, it starts processing the requests.

Can you share the following so I can reproduce this? I have working with the assumption of all AWQ models vs regular llama2-7b-chat as comparison points.

Model
Hardware
Length of the prompt, and length decode (output length)
The number of GPU blocks available for KV Cache: " INFO 11-23 01:31:46 llm_engine.py:207] # GPU blocks: 1770, # CPU blocks: 1638"
Your request load, how often are requests apart from each other? is there a load generator that you are using?
Any more detailed logs would be great

In the current state of batching algorithm, in the absence of bug, the 0.5-1 second might be the time it takes to perform the prefill for one request. This is roughly the same it takes to process 1000-2000 tokens depending on your hardware. In more detail, the algorithm is (and mentioned before, pending improvements):

prefill_waiting_queue = [...] # new_requests are adding to this queue from another thread
decode_waiting_queue = [...] # after prefill, requests are put here

while True:
    if prefill_waiting_queue is not empty and (memory available*): 
        run_prompt_prefill_processing # of the _current batch_, this is not one by one
        add_request_to_decode_queue
    if decode_waiting_queue is not empty:
        run_decode_generate # of the current batch
        add_request_back_to_queue (ordered by arrival time for fairness)
        
*: this is a simplification, but shouldn't affect this case

What you mentioned could be the following case:

0s: model loads
1.0s: first request arrives
1.0-2.0s: first prompt processing
1.2s: second request arrives
2.0s-3.0s: second prompt processing, because the first prompt already finished processing
3.0s-3.2s: both first and second requests start decodes
3.2s: third request arrives
3.2-4.2: third prompt processing
4.2-4.4: all three requests continue decodes
4.4-5.4: fourth request arrives
5.4-6.4: fourth prompt processing
6.4-..: continue decodes...

You can test whether this is the case by checking (1) average generation throughput in the log (2) test for a single request, their time to first token generated (by setting max_output_tokens=1).

The reason manually adding them works is so that requests are guaranteed to be prefilled together, instead of time apart. When they are prefilled together, latency of the entire batch only adds small overhead compare to latency of a single request.

The final solution to this will be for vLLM to implement chunked prefill. But I think there might be a way to encourage a batch of prefill requests in the AsyncLLMEngine, let me see...

casper-hansen · 2023-12-01T19:34:05Z

I have highlighted below the main problem that I see. When you stop decoding because a new request arrives, it can result in a slowdown in speed. Especially if you are working with large prompts. I think this is what the Dynamic SplitFuse from MII was supposed to address - essentially splitting a large prompt into multiple pieces to process them faster.

0s: model loads
1.0s: first request arrives
1.0-2.0s: first prompt processing
1.2s: second request arrives
2.0s-3.0s: second prompt processing, because the first prompt already finished processing
3.0s-3.2s: both first and second requests start decodes
3.2s: third request arrives
3.2-4.2: third prompt processing (PROBLEM: stops decoding)
PROBLEM ---> 4.2-4.4: all three requests continue decodes

jpeig · 2023-12-04T10:05:16Z

@simon-mo

Model: Using a lot of different models. Mostly mistral based. Lately I have been using OpenHermes-2.5 both with and without AWQ.

Hardware: 3090RTX, AMD Ryzen CPU

Length of the prompt: around 1000 tokens

Length decode (output length): around 3000 tokens

Blocks available: INFO 12-04 11:03:49 llm_engine.py:219] # GPU blocks: 3092, # CPU blocks: 2048

Request:

async def _batch_api(prompts, schemas, temperature, frequency_penalty):
  tasks = [concept_generator(prompt, schema, temperature, frequency_penalty) for prompt, schema in zip(prompts, schemas)]
  results = await asyncio.gather(*tasks)

async def concept_generator(prompt, schema, temperature, frequency_penalty, emit_progress_max, category=0):
    stream = False
    response = ""
    result = await openai.Completion.acreate(
        model=openai.Model.list()["data"][0]["id"],
        prompt=prompt,
        max_tokens=3000,
        temperature=temperature,
        frequency_penalty=frequency_penalty,
        stream = stream,
        jsonparser=schema)
        )

Payload:

python3 -m vllm.entrypoints.openai.api_server --model /mnt/c/AI/text-generation-webui/models/mlabonne_NeuralHermes-2.5-Mistral-7B --port 9999 --dtype auto --trust-remote-code --host 127.0.0.1 --max-model-len 16384

INFO 12-04 10:59:51 api_server.py:705] args: Namespace(host='127.0.0.1', port=9999, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], served_model_name=None, model='/mnt/c/AI/text-generation-webui/models/mlabonne_NeuralHermes-2.5-Mistral-7B', tokenizer=None, revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, download_dir=None, load_format='auto', dtype='auto', max_model_len=16384, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, engine_use_ray=False, disable_log_requests=False, max_log_len=None)

I am using LM format enforcer to enforce the output to proper JSON. This should not affect the handling of the requests.

INFO 12-01 17:14:35 llm_engine.py:636] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 6.9%, CPU KV cache usage: 0.0%
INFO 12-01 17:14:40 llm_engine.py:636] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.3%, CPU KV cache usage: 0.0%
INFO 12-01 17:14:45 llm_engine.py:636] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 7.7%, CPU KV cache usage: 0.0%
INFO 12-01 17:14:50 llm_engine.py:636] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.2%, CPU KV cache usage: 0.0%
INFO 12-01 17:14:55 llm_engine.py:636] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.3%, CPU KV cache usage: 0.0%
INFO 12-01 17:15:00 llm_engine.py:636] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.3%, CPU KV cache usage: 0.0%
INFO 12-01 17:15:05 llm_engine.py:636] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.3%, CPU KV cache usage: 0.0%
INFO 12-01 17:15:10 llm_engine.py:636] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 43.8 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 8.3%, CPU KV cache usage: 0.0%

I can literally hear my GPU not doing anything for about 10 seconds as requests are first handled sequentially.

simon-mo · 2023-12-06T19:31:57Z

@jpeig, the LM format enforcer bit is good hint. Given the low generation throughput, I'm suspecting this performance bug, which they just fixed recently:
noamgat/lm-format-enforcer#28 (comment)

tom-doerr · 2023-12-06T20:57:45Z

Could the format enforcer slow down all requests or just when format is used?

simon-mo · 2023-12-06T21:05:49Z

Currently format enforcer usage is per sequence (using vLLM's logits_processors api) so I believe you can turn it on and off, depending on your workload.

tom-doerr · 2023-12-06T23:57:02Z

Unrelated to the issue, but would be great to get parser support over the API

nlpkiddo-2001 · 2024-01-12T04:06:41Z

Hi , I am new to this vLLM, I need to make batch calls in vllm
prompts = ["Once upon a time .."] * 10

Is vllm has native support for this? and if so is this good approach as like sending individual request concurrently?
what would be tradeoff here?

thanks in advance

REIGN12 · 2024-03-06T08:59:28Z

So is there any updates for this issue?

StephaneBereux · 2024-03-11T10:09:09Z

Hi !
Thank you for what you're building at vLLM !
I have the same issue - did you manage to get it fixed ?
Do you know if the issue is fixed using another quantization method ?

gwuhaolin · 2024-07-04T09:57:29Z

@simon-mo

Thank you for the your response but AWQ does not appear to be the issue.

I tested 15 prompts without AWQ quantization, and I still get 0.5-1 second between handling each request. After the requests are handled, it starts processing the requests.

I can 'fix' the issue by not using the API and directly adding the requests - as @tom-doerr has said.

With a batch of 15 prompts, I experience a slowdown of roughly 10 seconds because of this.

So this is not an AWQ issue but an API / request handling issue.

same in Qwen2-7B-Instruct-AWQ with 1 * RTX2080ti-22G.

offline mode is 3X fast than vllm.entrypoints.openai.api_server with same config:

offline mode log:

model = '/data/models/Qwen2-7B-Instruct-AWQ'
tokenizer = AutoTokenizer.from_pretrained(model, trust_remote_code=True)
llm = LLM(model=model, trust_remote_code=True, tensor_parallel_size=1)
messages = [
        {"role": "system", "content": role},
        {"role": "user", "content": q}
]
text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
)
outputs = llm.generate([text], SamplingParams(temperature=0.7, top_p=0.8, repetition_penalty=1.05, max_tokens=128 * 1024))


INFO 07-04 17:52:11 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/models/Qwen2-7B-Instruct-AWQ', speculative_config=None, tokenizer='/data/models/Qwen2-7B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/models/Qwen2-7B-Instruct-AWQ)
Processed prompts: 100%|██████████| 1/1 [00:02<00:00,  2.77s/it, est. speed input: 760.04 toks/s, output: 52.30 toks/s]

offline mode GPU use:

api_server mode log:

python -m vllm.entrypoints.openai.api_server --model /data/models/Qwen2-7B-Instruct-AWQ --tensor-parallel-size 1  --trust-remote-code

INFO 07-04 17:55:27 llm_engine.py:161] Initializing an LLM engine (v0.5.0.post1) with config: model='/data/models/Qwen2-7B-Instruct-AWQ', speculative_config=None, tokenizer='/data/models/Qwen2-7B-Instruct-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=/data/models/Qwen2-7B-Instruct-AWQ)

INFO 07-04 17:57:51 metrics.py:341] Avg prompt throughput: 17.6 tokens/s, Avg generation throughput: 9.7 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-04 17:57:56 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 11.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 07-04 17:58:01 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 8.9 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.
INFO 07-04 17:58:06 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12.0 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.2%, CPU KV cache usage: 0.0%.

api_server mode GPU use:

bluenevus · 2024-09-27T20:42:55Z

+1

AmericanPresidentJimmyCarter · 2024-09-28T05:50:30Z

I have this same issue, but it is only on one of my servers. The other is fine.

On the bad server it drops off like this periodically:

INFO 09-28 05:36:21 metrics.py:351] Avg prompt throughput: 148.3 tokens/s, Avg generation throughput: 99.9 tokens/s, Running: 3 reqs, Swapped: 0 reqs, Pending: 2 reqs, GPU KV cache usage: 95.8%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.1:37936 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:37940 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:37862 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 09-28 05:36:26 metrics.py:351] Avg prompt throughput: 131.5 tokens/s, Avg generation throughput: 94.4 tokens/s, Running: 2 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 73.4%, CPU KV cache usage: 0.0%.
INFO:     127.0.0.1:37908 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO:     127.0.0.1:37920 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO 09-28 05:36:38 metrics.py:351] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 15.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 30.2%, CPU KV cache usage: 0.0%.

I did pip freeze on both systems and diff'd, and the only obvious difference I saw was aiohttp==3.9.5 on the slow machine and aiohttp==3.10.4 on the fast machine. Updating the slow machine to 3.10.4 seems to have solved the issue, and the formerly slow machine is spending a lot less time idling now.

github-actions · 2024-12-28T02:00:05Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

krisztianboros · 2025-01-14T02:52:39Z

+1

furkancoskun · 2025-01-21T06:53:24Z

+1

robertgshaw2-redhat · 2025-02-02T14:50:04Z

We have made a lot of progress on the API server in the past year. Please open a new issue with more specifics if needed.

jpeig closed this as completed Nov 18, 2023

jpeig reopened this Nov 21, 2023

simon-mo added the bug Something isn't working label Nov 21, 2023

simon-mo self-assigned this Nov 21, 2023

jpeig mentioned this issue Nov 22, 2023

vlllm stops generation when pending request > 0 #1734

Closed

jpeig mentioned this issue Nov 23, 2023

The API gets stuck (processing concurrent requests) #1762

Closed

This was referenced Nov 30, 2023

[v0.2.3] Release Tracker #1856

Closed

[Docs] Update the AWQ documentation to highlight performance issue #1883

Merged

simon-mo changed the title ~~Batching is not supported through the API~~ AWQ models cause slowdown and effectively not batched Dec 1, 2023

simon-mo mentioned this issue Dec 1, 2023

(Async) Batch request, OpenAI API server #1636

Closed

jpeig changed the title ~~AWQ models cause slowdown and effectively not batched~~ API causes slowdown in batch request handling Dec 1, 2023

casper-hansen mentioned this issue Dec 1, 2023

Throughput issues with Tensor Parallelism (Mistral 7B + FP16) #1888

Closed

tom-doerr mentioned this issue Jan 8, 2024

added support for json and regex formating #2205

Closed

github-actions bot added the stale Over 90 days of inactivity label Dec 28, 2024

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Jan 15, 2025

robertgshaw2-redhat closed this as completed Feb 2, 2025

API causes slowdown in batch request handling #1707

API causes slowdown in batch request handling #1707

Comments

jpeig commented Nov 17, 2023

jpeig commented Nov 17, 2023

simon-mo commented Nov 17, 2023

simon-mo commented Nov 17, 2023

jpeig commented Nov 18, 2023

simon-mo commented Nov 18, 2023

jpeig commented Nov 19, 2023 • edited Loading

simon-mo commented Nov 19, 2023

jpeig commented Nov 21, 2023

jpeig commented Nov 21, 2023 • edited Loading

simon-mo commented Nov 21, 2023

jajj50386 commented Nov 21, 2023 • edited Loading

jpeig commented Nov 22, 2023

jajj50386 commented Nov 22, 2023 • edited Loading

tom-doerr commented Nov 23, 2023 • edited Loading

jpeig commented Nov 23, 2023

simon-mo commented Nov 23, 2023

tom-doerr commented Nov 23, 2023 • edited Loading

yungangwu commented Nov 24, 2023

tom-doerr commented Nov 26, 2023

simon-mo commented Nov 26, 2023 via email

tom-doerr commented Nov 30, 2023

jpeig commented Nov 30, 2023

simon-mo commented Dec 1, 2023 • edited Loading

jpeig commented Dec 1, 2023

jajj50386 commented Dec 1, 2023

simon-mo commented Dec 1, 2023

casper-hansen commented Dec 1, 2023 • edited Loading

jpeig commented Dec 4, 2023 • edited Loading

simon-mo commented Dec 6, 2023

tom-doerr commented Dec 6, 2023

simon-mo commented Dec 6, 2023

tom-doerr commented Dec 6, 2023

nlpkiddo-2001 commented Jan 12, 2024

REIGN12 commented Mar 6, 2024

StephaneBereux commented Mar 11, 2024

gwuhaolin commented Jul 4, 2024 • edited Loading

bluenevus commented Sep 27, 2024

AmericanPresidentJimmyCarter commented Sep 28, 2024

github-actions bot commented Dec 28, 2024

krisztianboros commented Jan 14, 2025

furkancoskun commented Jan 21, 2025

robertgshaw2-redhat commented Feb 2, 2025

jpeig commented Nov 19, 2023 •

edited

Loading

jpeig commented Nov 21, 2023 •

edited

Loading

jajj50386 commented Nov 21, 2023 •

edited

Loading

jajj50386 commented Nov 22, 2023 •

edited

Loading

tom-doerr commented Nov 23, 2023 •

edited

Loading

tom-doerr commented Nov 23, 2023 •

edited

Loading

simon-mo commented Dec 1, 2023 •

edited

Loading

casper-hansen commented Dec 1, 2023 •

edited

Loading

jpeig commented Dec 4, 2023 •

edited

Loading

gwuhaolin commented Jul 4, 2024 •

edited

Loading