Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Production Metrics in Prometheus format #1890

Merged
merged 8 commits into from
Dec 3, 2023

Conversation

simon-mo
Copy link
Collaborator

@simon-mo simon-mo commented Dec 2, 2023

This one builds on #1662. It adds aioprometheus as a dependency which is very lightweight. It exposes the metrics as we perform the regular logging pass in engine step. The memory usage is very small and constant (there is no history, only current state). The metrics is designed to be scraped and stored by external service.

We currently just add metrics in engine step. Follow up: #1870

Here's an example of the metrics endpoint output.

# HELP exceptions_total_counter Total number of requested which generated an exception
# TYPE exceptions_total_counter counter
# HELP requests_total_counter Total number of requests received
# TYPE requests_total_counter counter
requests_total_counter{method="POST",path="/v1/chat/completions"} 1
# HELP responses_total_counter Total number of responses sent
# TYPE responses_total_counter counter
responses_total_counter{method="POST",path="/v1/chat/completions"} 1
# HELP status_codes_counter Total number of response status codes
# TYPE status_codes_counter counter
status_codes_counter{method="POST",path="/v1/chat/completions",status_code="200"} 1
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
vllm:avg_generation_throughput_toks_per_s{model_name="facebook/opt-125m"} 165.37719973407303
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
vllm:avg_prompt_throughput_toks_per_s{model_name="facebook/opt-125m"} 0.0
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
vllm:cpu_cache_usage_perc{model_name="facebook/opt-125m"} 0.0
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
vllm:gpu_cache_usage_perc{model_name="facebook/opt-125m"} 0.002364273204903678
# HELP vllm:num_requests_running Number of requests that is currently running for inference.
# TYPE vllm:num_requests_running gauge
vllm:num_requests_running{model_name="facebook/opt-125m"} 1
# HELP vllm:num_requests_swapped Number requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
vllm:num_requests_swapped{model_name="facebook/opt-125m"} 0
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge
vllm:num_requests_waiting{model_name="facebook/opt-125m"} 0

@ichernev
Copy link
Contributor

ichernev commented Dec 2, 2023

That looks great!

Sorry for the delay

Copy link
Collaborator

@Yard1 Yard1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo Thanks for the awesome work! Left some minor comments.

vllm/engine/llm_engine.py Outdated Show resolved Hide resolved
vllm/entrypoints/openai/api_server.py Outdated Show resolved Hide resolved
vllm/engine/llm_engine.py Outdated Show resolved Hide resolved
vllm/engine/metrics.py Outdated Show resolved Hide resolved
@simon-mo
Copy link
Collaborator Author

simon-mo commented Dec 2, 2023

@WoosukKwon updated!

@WoosukKwon
Copy link
Collaborator

@simon-mo A dumb question: Can you provide an example script that I can test the PR?

@simon-mo
Copy link
Collaborator Author

simon-mo commented Dec 2, 2023

python -m vllm.entrypoints.openai.api_server
curl http://localhost:8000/metrics

@WoosukKwon
Copy link
Collaborator

WoosukKwon commented Dec 2, 2023

I see. I was testing http://localhost:8000/v1/metrics 😓 Do you think it's better to put it to http://localhost:8000/metrics (without v1)?

@Yard1
Copy link
Collaborator

Yard1 commented Dec 2, 2023

It's probably not something you want to expose to users, nor is it a part of OpenAI API spec. It's better to leave it at /metrics IMO.

Copy link
Collaborator

@WoosukKwon WoosukKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@WoosukKwon
Copy link
Collaborator

@simon-mo WDYT?

I see. I was testing http://localhost:8000/v1/metrics 😓 Do you think it's better to put it to http://localhost:8000/metrics (without v1)?

@simon-mo
Copy link
Collaborator Author

simon-mo commented Dec 3, 2023

/v1/* should be openai compliant API, so /metrics is better so it doesn't collide with openai namespace.

@simon-mo simon-mo merged commit 5313c2c into vllm-project:main Dec 3, 2023
2 checks passed
xjpang pushed a commit to xjpang/vllm that referenced this pull request Dec 4, 2023
@beginlner
Copy link
Contributor

beginlner commented Dec 30, 2023

Hi @simon-mo , it seems like not working properly when engine_use_ray=True.
Here is my metrics endpoint output with engine_use_ray=True.

# HELP exceptions_total_counter Total number of requested which generated an exception
# TYPE exceptions_total_counter counter
# HELP requests_total_counter Total number of requests received
# TYPE requests_total_counter counter
requests_total_counter{method="POST",path="/v1/chat/completions"} 548
requests_total_counter{method="GET",path="/v1/metrics"} 2
requests_total_counter{method="GET",path="/metric"} 2
requests_total_counter{method="GET",path="/v1"} 1
requests_total_counter{method="GET",path="/v1/completions"} 2
# HELP responses_total_counter Total number of responses sent
# TYPE responses_total_counter counter
responses_total_counter{method="POST",path="/v1/chat/completions"} 548
responses_total_counter{method="GET",path="/v1/metrics"} 2
responses_total_counter{method="GET",path="/metric"} 2
responses_total_counter{method="GET",path="/v1"} 1
responses_total_counter{method="GET",path="/v1/completions"} 2
# HELP status_codes_counter Total number of response status codes
# TYPE status_codes_counter counter
status_codes_counter{method="POST",path="/v1/chat/completions",status_code="200"} 548
status_codes_counter{method="GET",path="/v1/metrics",status_code="404"} 2
status_codes_counter{method="GET",path="/metric",status_code="404"} 2
status_codes_counter{method="GET",path="/v1",status_code="404"} 1
status_codes_counter{method="GET",path="/v1/completions",status_code="405"} 2
# HELP vllm:avg_generation_throughput_toks_per_s Average generation throughput in tokens/s.
# TYPE vllm:avg_generation_throughput_toks_per_s gauge
# HELP vllm:avg_prompt_throughput_toks_per_s Average prefill throughput in tokens/s.
# TYPE vllm:avg_prompt_throughput_toks_per_s gauge
# HELP vllm:cpu_cache_usage_perc CPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:cpu_cache_usage_perc gauge
# HELP vllm:gpu_cache_usage_perc GPU KV-cache usage. 1 means 100 percent usage.
# TYPE vllm:gpu_cache_usage_perc gauge
# HELP vllm:num_requests_running Number of requests that is currently running for inference.
# TYPE vllm:num_requests_running gauge
# HELP vllm:num_requests_swapped Number requests swapped to CPU.
# TYPE vllm:num_requests_swapped gauge
# HELP vllm:num_requests_waiting Number of requests waiting to be processed.
# TYPE vllm:num_requests_waiting gauge

@ronensc
Copy link
Contributor

ronensc commented Jan 1, 2024

Thank you for the PR! I noticed that aioprometheus is being used for exposing Prometheus metrics. However, it seems like there are alternative libraries that might offer additional features.

Here are a few considerations:

  • Official Prometheus Python Library: The official library can be found at https://github.com/prometheus/client_python. It includes out-of-the-box support for various metrics, such as GC, process, and platform metrics.

  • Alternative Libraries: Two alternative libraries that build on the official library and provide additional features are:

    1. prometheus-fastapi-instrumentator: Exposes FastAPI metrics.
    2. starlette_exporter: Exposes Starlette and FastAPI metrics.

    Both of these libraries include HTTP-related metrics similar to aioprometheus but also cover latency and size metrics, which are not present in aioprometheus.

  • Popularity and Metrics Coverage: It's worth noting that these alternative libraries appear to be more popular based on GitHub stars. Additionally, they provide a broader range of metrics compared to aioprometheus.

I'm curious to understand the rationale behind choosing aioprometheus over the official library or one of these alternatives. Could you please share the reasons for this decision?

@ronensc
Copy link
Contributor

ronensc commented Jan 8, 2024

Thank you for the PR! I noticed that aioprometheus is being used for exposing Prometheus metrics. However, it seems like there are alternative libraries that might offer additional features.

Here are a few considerations:

* **Official Prometheus Python Library:** The official library can be found at https://github.com/prometheus/client_python. It includes out-of-the-box support for various metrics, such as [GC, process, and platform](https://prometheus.github.io/client_python/collector/#disabling-default-collector-metrics) metrics.

* **Alternative Libraries:** Two alternative libraries that build on the official library and provide additional features are:
  
  1. [prometheus-fastapi-instrumentator](https://github.com/trallnag/prometheus-fastapi-instrumentator): Exposes FastAPI metrics.
  2. [starlette_exporter](https://github.com/stephenhillier/starlette_exporter): Exposes Starlette and FastAPI metrics.
  
  Both of these libraries include HTTP-related metrics similar to `aioprometheus` but also cover latency and size metrics, which are not present in `aioprometheus`.

* **Popularity and Metrics Coverage:** It's worth noting that these alternative libraries appear to be more popular based on GitHub stars. Additionally, they provide a broader range of metrics compared to `aioprometheus`.

I'm curious to understand the rationale behind choosing aioprometheus over the official library or one of these alternatives. Could you please share the reasons for this decision?

@ichernev @simon-mo @WoosukKwon @Yard1 Could you please share your thoughts on this matter?

@simon-mo
Copy link
Collaborator Author

simon-mo commented Jan 8, 2024

I don't have particular preference. It seems aioprometheus has both good integration and an all in one lightweight package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants