Add Production Metrics in Prometheus format #1890

simon-mo · 2023-12-02T06:07:27Z

This one builds on #1662. It adds aioprometheus as a dependency which is very lightweight. It exposes the metrics as we perform the regular logging pass in engine step. The memory usage is very small and constant (there is no history, only current state). The metrics is designed to be scraped and stored by external service.

We currently just add metrics in engine step. Follow up: #1870

Here's an example of the metrics endpoint output.

# HELP exceptions_total_counter Total number of requested which generated an exception
# TYPE exceptions_total_counter counter
# HELP requests_total_counter Total number of requests received
# TYPE requests_total_counter counter
requests_total_counter{method="POST",path="/v1/chat/completions"} 1
# HELP responses_total_counter Total number of responses sent
# TYPE responses_total_counter counter
responses_total_counter{method="POST",path="/v1/chat/completions"} 1
# HELP status_codes_counter Total number of response status codes
# TYPE status_codes_counter counter
status_codes_counter{method="POST",path="/v1/chat/completions",status_code="200"} 1
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="facebook/opt-125m"} 165.37719973407303
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="facebook/opt-125m"} 0.0
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="facebook/opt-125m"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="facebook/opt-125m"} 0.002364273204903678
# HELP vllm:num_requests_running Number of requests that is currently running for inference.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="facebook/opt-125m"} 1
# HELP vllm:num_requests_swapped Number requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="facebook/opt-125m"} 0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="facebook/opt-125m"} 0

Add /metrics for openai endpoint with the metrics that were already logged.

ichernev · 2023-12-02T09:56:18Z

That looks great!

Sorry for the delay

Yard1

Looks good to me!

WoosukKwon

@simon-mo Thanks for the awesome work! Left some minor comments.

vllm/engine/llm_engine.py

vllm/entrypoints/openai/api_server.py

vllm/engine/llm_engine.py

vllm/engine/metrics.py

simon-mo · 2023-12-02T23:24:33Z

@WoosukKwon updated!

WoosukKwon · 2023-12-02T23:51:06Z

@simon-mo A dumb question: Can you provide an example script that I can test the PR?

simon-mo · 2023-12-02T23:52:08Z

python -m vllm.entrypoints.openai.api_server
curl http://localhost:8000/metrics

WoosukKwon · 2023-12-02T23:55:49Z

I see. I was testing http://localhost:8000/v1/metrics 😓 Do you think it's better to put it to http://localhost:8000/metrics (without v1)?

Yard1 · 2023-12-02T23:57:23Z

It's probably not something you want to expose to users, nor is it a part of OpenAI API spec. It's better to leave it at /metrics IMO.

vllm/entrypoints/openai/api_server.py

WoosukKwon

LGTM.

WoosukKwon · 2023-12-03T00:29:28Z

@simon-mo WDYT?

I see. I was testing http://localhost:8000/v1/metrics 😓 Do you think it's better to put it to http://localhost:8000/metrics (without v1)?

simon-mo · 2023-12-03T00:31:00Z

/v1/* should be openai compliant API, so /metrics is better so it doesn't collide with openai namespace.

beginlner · 2023-12-30T10:52:35Z

Hi @simon-mo , it seems like not working properly when engine_use_ray=True.
Here is my metrics endpoint output with engine_use_ray=True.

# HELP exceptions_total_counter Total number of requested which generated an exception
# TYPE exceptions_total_counter counter
# HELP requests_total_counter Total number of requests received
# TYPE requests_total_counter counter
requests_total_counter{method="POST",path="/v1/chat/completions"} 548
requests_total_counter{method="GET",path="/v1/metrics"} 2
requests_total_counter{method="GET",path="/metric"} 2
requests_total_counter{method="GET",path="/v1"} 1
requests_total_counter{method="GET",path="/v1/completions"} 2
# HELP responses_total_counter Total number of responses sent
# TYPE responses_total_counter counter
responses_total_counter{method="POST",path="/v1/chat/completions"} 548
responses_total_counter{method="GET",path="/v1/metrics"} 2
responses_total_counter{method="GET",path="/metric"} 2
responses_total_counter{method="GET",path="/v1"} 1
responses_total_counter{method="GET",path="/v1/completions"} 2
# HELP status_codes_counter Total number of response status codes
# TYPE status_codes_counter counter
status_codes_counter{method="POST",path="/v1/chat/completions",status_code="200"} 548
status_codes_counter{method="GET",path="/v1/metrics",status_code="404"} 2
status_codes_counter{method="GET",path="/metric",status_code="404"} 2
status_codes_counter{method="GET",path="/v1",status_code="404"} 1
status_codes_counter{method="GET",path="/v1/completions",status_code="405"} 2
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
# HELP vllm:num_requests_running Number of requests that is currently running for inference.
# TYPE vllm:num_requests_running gauge
# HELP vllm:num_requests_swapped Number requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge

ronensc · 2024-01-01T12:40:33Z

Thank you for the PR! I noticed that aioprometheus is being used for exposing Prometheus metrics. However, it seems like there are alternative libraries that might offer additional features.

Here are a few considerations:

Official Prometheus Python Library: The official library can be found at https://github.com/prometheus/client_python. It includes out-of-the-box support for various metrics, such as GC, process, and platform metrics.
Alternative Libraries: Two alternative libraries that build on the official library and provide additional features are:
1. prometheus-fastapi-instrumentator: Exposes FastAPI metrics.
2. starlette_exporter: Exposes Starlette and FastAPI metrics.
Both of these libraries include HTTP-related metrics similar to aioprometheus but also cover latency and size metrics, which are not present in aioprometheus.
Popularity and Metrics Coverage: It's worth noting that these alternative libraries appear to be more popular based on GitHub stars. Additionally, they provide a broader range of metrics compared to aioprometheus.

I'm curious to understand the rationale behind choosing aioprometheus over the official library or one of these alternatives. Could you please share the reasons for this decision?

ronensc · 2024-01-08T12:33:33Z

Thank you for the PR! I noticed that aioprometheus is being used for exposing Prometheus metrics. However, it seems like there are alternative libraries that might offer additional features.

Here are a few considerations:

* **Official Prometheus Python Library:** The official library can be found at https://github.com/prometheus/client_python. It includes out-of-the-box support for various metrics, such as [GC, process, and platform](https://prometheus.github.io/client_python/collector/#disabling-default-collector-metrics) metrics.

* **Alternative Libraries:** Two alternative libraries that build on the official library and provide additional features are:
  
  1. [prometheus-fastapi-instrumentator](https://github.com/trallnag/prometheus-fastapi-instrumentator): Exposes FastAPI metrics.
  2. [starlette_exporter](https://github.com/stephenhillier/starlette_exporter): Exposes Starlette and FastAPI metrics.
  
  Both of these libraries include HTTP-related metrics similar to `aioprometheus` but also cover latency and size metrics, which are not present in `aioprometheus`.

* **Popularity and Metrics Coverage:** It's worth noting that these alternative libraries appear to be more popular based on GitHub stars. Additionally, they provide a broader range of metrics compared to `aioprometheus`.

I'm curious to understand the rationale behind choosing aioprometheus over the official library or one of these alternatives. Could you please share the reasons for this decision?

@ichernev @simon-mo @WoosukKwon @Yard1 Could you please share your thoughts on this matter?

simon-mo · 2024-01-08T17:07:36Z

I don't have particular preference. It seems aioprometheus has both good integration and an all in one lightweight package.

yessen-deepinfra and others added 5 commits November 16, 2023 13:35

Add support for prometheus metrics

20f77fb

Add /metrics for openai endpoint with the metrics that were already logged.

use aioprometheus as default, refactor code for modularity

fcb95ba

format code

2c243da

Merge branch 'main' of github.com:vllm-project/vllm into metrics-v2

2788f42

lint

37c902d

simon-mo requested review from zhuohan123 and WoosukKwon December 2, 2023 06:07

This was referenced Dec 2, 2023

Add support for prometheus metrics #1662

Closed

[v0.2.3] Release Tracker #1856

Closed

remove aioprometheus in dockerfile

c3296bd

Yard1 approved these changes Dec 2, 2023

View reviewed changes

WoosukKwon reviewed Dec 2, 2023

View reviewed changes

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved

vllm/engine/metrics.py Outdated Show resolved Hide resolved

comments

e48c541

WoosukKwon reviewed Dec 3, 2023

View reviewed changes

vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved

import

604130f

WoosukKwon approved these changes Dec 3, 2023

View reviewed changes

simon-mo merged commit 5313c2c into vllm-project:main Dec 3, 2023
2 checks passed

xjpang pushed a commit to xjpang/vllm that referenced this pull request Dec 4, 2023

Add Production Metrics in Prometheus format (vllm-project#1890)

5dbab33

SebastianBodza mentioned this pull request Dec 5, 2023

Exposing Prometheus metrics lm-sys/FastChat#2776

Open

robertgshaw2-redhat mentioned this pull request Jan 1, 2024

Refactor Prometheus and Add Request Level Metrics #2316

Merged

hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024

Add Production Metrics in Prometheus format (vllm-project#1890)

ba508fa

markmc mentioned this pull request Jan 24, 2025

[V1][Metrics] Add initial Prometheus logger #12416

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Production Metrics in Prometheus format #1890

Add Production Metrics in Prometheus format #1890

simon-mo commented Dec 2, 2023

ichernev commented Dec 2, 2023

Yard1 left a comment

WoosukKwon left a comment

simon-mo commented Dec 2, 2023

WoosukKwon commented Dec 2, 2023

simon-mo commented Dec 2, 2023

WoosukKwon commented Dec 2, 2023 •

edited

Loading

Yard1 commented Dec 2, 2023

WoosukKwon left a comment

WoosukKwon commented Dec 3, 2023

simon-mo commented Dec 3, 2023

beginlner commented Dec 30, 2023 •

edited

Loading

ronensc commented Jan 1, 2024

ronensc commented Jan 8, 2024

simon-mo commented Jan 8, 2024

Add Production Metrics in Prometheus format #1890

Add Production Metrics in Prometheus format #1890

Conversation

simon-mo commented Dec 2, 2023

ichernev commented Dec 2, 2023

Yard1 left a comment

Choose a reason for hiding this comment

WoosukKwon left a comment

Choose a reason for hiding this comment

simon-mo commented Dec 2, 2023

WoosukKwon commented Dec 2, 2023

simon-mo commented Dec 2, 2023

WoosukKwon commented Dec 2, 2023 • edited Loading

Yard1 commented Dec 2, 2023

WoosukKwon left a comment

Choose a reason for hiding this comment

WoosukKwon commented Dec 3, 2023

simon-mo commented Dec 3, 2023

beginlner commented Dec 30, 2023 • edited Loading

ronensc commented Jan 1, 2024

ronensc commented Jan 8, 2024

simon-mo commented Jan 8, 2024

WoosukKwon commented Dec 2, 2023 •

edited

Loading

beginlner commented Dec 30, 2023 •

edited

Loading