Add prometheus for statistics #90

HKalbasi · 2024-12-19T22:06:47Z

This PR adds some statistics using prometheus. Prometheus is a time series database which is widely used for collecting metrics.

This is a per core chart of total bits per second. It is on my local machine with a fake traffic so it is in order of Kbit/s.

I put some effort in making the statistics cost negligible. I store them in thread local variables, and flush them into the atomic ones occasionally (currently every 1000 idle iterations). I also made a parameter for each core even for stats that their per core versions doesn't matter much, in order to prevent contention. To run the prometheus server, I used the main core. Previously it did only monitorings, now it does monitoring and runs a http server for prometheus concurrently using a single threaded tokio runtime.

HKalbasi · 2025-01-12T20:25:15Z

@thearossman What is your opinion on this?

thearossman · 2025-01-13T00:52:59Z

Hi! There's a lot here, and I'm not familiar with Prometheus, so I'm still parsing. We're having some issues with our live traffic setup right now, but I'll take a closer look and try to run this (hopefully) later this week to check whether it holds up at line rate.

Initial thoughts:

I can see how a queryable time series database for network statistics is necessary as Retina matures. (I also think something like this will be just as, if not even more, useful for user applications to record their data of interest!)
I'm not seeing clear documentation -- could you add docs in stats.rs and anywhere else relevant? Thinking: data collected and what it means, data format, and anything that needs to be set up for monitoring. Would want information to be easily accessible on https://stanford-esrg.github.io/retina/retina_core/
Seems like this should be something that is wrapped in a feature flag?

Going to request that either @tbarbette or @thegwan weigh in here as well, since it's a big addition and I'm not familiar with this framework.

zakird · 2025-01-13T01:32:46Z

I haven't taken a look at the code in this PR, but I'm very supportive of the notion of exposing metrics to Prometheus (which is pretty much the choice for this type of monitoring in 2024). This feels like a very sane way of capturing metrics over time from Retina.

thearossman · 2025-01-14T21:20:10Z

Could you add setup/interpretation instructions for this? Will test for performance on our traffic, but it would be helpful to have easier setup steps!

HKalbasi · 2025-01-15T17:29:02Z

Could you add setup/interpretation instructions for this? Will test for performance on our traffic, but it would be helpful to have easier setup steps!

I added a simple guide in the documentation of stats module.

I also think something like this will be just as, if not even more, useful for user applications to record their data of interest!

That's really interesting! Maybe we can expose something so that users also can register their metrics and we export all of them. I need to think about that.

Seems like this should be something that is wrapped in a feature flag?

I will make it configurable in the toml file. But I'm not opposed to make it configurable in compile time as well.

thearossman · 2025-01-16T01:19:41Z

Confirming I ran the prometheus-websites example and it worked on ~110Gbps of our traffic! I haven't yet gotten up the prometheus server (I think that's a me problem).

I think it would still be nice to have a compile- or runtime feature flag to enable/disable this export. Could be in a later commit.

Thanks for the additions, and thanks for your patience with the review turnaround over the holidays

HKalbasi · 2025-01-16T22:27:45Z

I think it would still be nice to have a compile- or runtime feature flag to enable/disable this export. Could be in a later commit.

I added a runtime configuration which disables the exporter server and the atomic counters (thread locals will still get incremented).

Thanks for the additions, and thanks for your patience with the review turnaround over the holidays

Thanks for maintaining this. I worked with many open source projects, and retina is easily among top 10% in term of maintenance.

thearossman · 2025-01-21T03:29:28Z

I haven't looked at the actual UI (looks like there's a fw/config issue on our I did confirm that running the websites-prometheus example with the Prometheus config set up could handle the ~110Gbps that was on our network at the time. I also ran this in offline mode and the data looks as expected.

I feel pretty good about merging this, especially since it can be switched on/off! Will leave it up for another day in case anyone else (@thegwan @tbarbette ?) has any thoughts.

tbarbette · 2025-01-22T08:32:24Z

Good for me ! Having compile-time flags for new features is important, else we'll end up with a bloatware like Zeek. And I hear about Prometheus around me, so it's nice, thanks !

thearossman · 2025-01-22T17:07:50Z

@HKalbasi if I merged this, how would you feel about changing the runtime config to a compile-time feature (similar to how the mlx5 feature is implemented) sometime in the future?

Thanks again, excited to have this!

HKalbasi · 2025-01-22T19:55:33Z

I added a compile time feature for this. I kept the runtime configuration as well since we needed the port and ip configuration anyway and disabling the server in the absence of them is a sane default.

HKalbasi added 6 commits December 20, 2024 01:10

Add prometheus for statistics

a6faa97

Add dpdk metrics to prometheus stats

94473c4

replace prometheus with prometheus_client

f504335

replace prometheus with prometheus_client

1301738

Add stats for tcp and udp packets

14c88e0

Add stats for number of connections

27ad68e

Add some documentation for the stats module

2f872d4

HKalbasi force-pushed the add-prometheus branch from f4e9c42 to 2f872d4 Compare January 15, 2025 17:25

Export prometheus infra to apps to register custom metrics

3dffbca

Make prometheus configurable

97d3c8a

Make prometheus configurable at compile time

a374c44

thearossman merged commit c3d5bad into stanford-esrg:main Jan 22, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add prometheus for statistics #90

Add prometheus for statistics #90

HKalbasi commented Dec 19, 2024 •

edited

Loading

HKalbasi commented Jan 12, 2025

thearossman commented Jan 13, 2025

zakird commented Jan 13, 2025 •

edited

Loading

thearossman commented Jan 14, 2025

HKalbasi commented Jan 15, 2025

thearossman commented Jan 16, 2025 •

edited

Loading

HKalbasi commented Jan 16, 2025

thearossman commented Jan 21, 2025

tbarbette commented Jan 22, 2025

thearossman commented Jan 22, 2025

HKalbasi commented Jan 22, 2025

Add prometheus for statistics #90

Add prometheus for statistics #90

Conversation

HKalbasi commented Dec 19, 2024 • edited Loading

HKalbasi commented Jan 12, 2025

thearossman commented Jan 13, 2025

zakird commented Jan 13, 2025 • edited Loading

thearossman commented Jan 14, 2025

HKalbasi commented Jan 15, 2025

thearossman commented Jan 16, 2025 • edited Loading

HKalbasi commented Jan 16, 2025

thearossman commented Jan 21, 2025

tbarbette commented Jan 22, 2025

thearossman commented Jan 22, 2025

HKalbasi commented Jan 22, 2025

HKalbasi commented Dec 19, 2024 •

edited

Loading

zakird commented Jan 13, 2025 •

edited

Loading

thearossman commented Jan 16, 2025 •

edited

Loading