Concurrent Request Processing #1462

rangehow · 2024-09-19T04:50:02Z

rangehow
Sep 19, 2024

I sent asynchronous requests to the OpenAI server of the sglang host. I set the concurrency to 1024, using following decorator.

def limit_async_func_call(max_size: int=1024):
    sem = asyncio.Semaphore(max_size)
    active_requests = 0

    def final_decro(func):
        @wraps(func)
        async def wait_func(*args, **kwargs):
            nonlocal active_requests
            async with sem:
                active_requests += 1
                logger.info(f"Active requests: {active_requests}")
                try:
                    return await func(*args, **kwargs)
                finally:
                    active_requests -= 1
        return wait_func
    return final_decro

client output

INFO:Active requests: 1024

but I observed the following output from server:

12:44:48 DP1 TP0] Decode batch. #running-req: 93, #token: 334480, token usage: 0.36, gen throughput (token/s): 1738.60, #queue-req: 0
[12:44:49 DP0 TP0] Decode batch. #running-req: 108, #token: 390950, token usage: 0.42, gen throughput (token/s): 1764.95, #queue-req: 0

Why doesn't the number of running requests reach 1024, and why aren't the additional requests in the request queue?

rangehow · 2024-09-19T06:46:59Z

rangehow
Sep 19, 2024
Author

At the same time, I would like to know if it’s possible for sglang to have a simple server management UI that allows us to view real-time load queue data for prefill and decode, or alternatively, an interface would also work.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent Request Processing #1462

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

Concurrent Request Processing #1462

rangehow Sep 19, 2024

Replies: 1 comment

rangehow Sep 19, 2024 Author

rangehow
Sep 19, 2024

rangehow
Sep 19, 2024
Author