Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add prometheus for statistics #90

Merged
merged 10 commits into from
Jan 22, 2025

Conversation

HKalbasi
Copy link
Contributor

@HKalbasi HKalbasi commented Dec 19, 2024

This PR adds some statistics using prometheus. Prometheus is a time series database which is widely used for collecting metrics.

image

This is a per core chart of total bits per second. It is on my local machine with a fake traffic so it is in order of Kbit/s.

I put some effort in making the statistics cost negligible. I store them in thread local variables, and flush them into the atomic ones occasionally (currently every 1000 idle iterations). I also made a parameter for each core even for stats that their per core versions doesn't matter much, in order to prevent contention. To run the prometheus server, I used the main core. Previously it did only monitorings, now it does monitoring and runs a http server for prometheus concurrently using a single threaded tokio runtime.

@HKalbasi
Copy link
Contributor Author

@thearossman What is your opinion on this?

@thearossman
Copy link
Collaborator

Hi! There's a lot here, and I'm not familiar with Prometheus, so I'm still parsing. We're having some issues with our live traffic setup right now, but I'll take a closer look and try to run this (hopefully) later this week to check whether it holds up at line rate.

Initial thoughts:

  • I can see how a queryable time series database for network statistics is necessary as Retina matures. (I also think something like this will be just as, if not even more, useful for user applications to record their data of interest!)
  • I'm not seeing clear documentation -- could you add docs in stats.rs and anywhere else relevant? Thinking: data collected and what it means, data format, and anything that needs to be set up for monitoring. Would want information to be easily accessible on https://stanford-esrg.github.io/retina/retina_core/
  • Seems like this should be something that is wrapped in a feature flag?

Going to request that either @tbarbette or @thegwan weigh in here as well, since it's a big addition and I'm not familiar with this framework.

@zakird
Copy link
Member

zakird commented Jan 13, 2025

I haven't taken a look at the code in this PR, but I'm very supportive of the notion of exposing metrics to Prometheus (which is pretty much the choice for this type of monitoring in 2024). This feels like a very sane way of capturing metrics over time from Retina.

@thearossman
Copy link
Collaborator

Could you add setup/interpretation instructions for this? Will test for performance on our traffic, but it would be helpful to have easier setup steps!

@HKalbasi
Copy link
Contributor Author

Could you add setup/interpretation instructions for this? Will test for performance on our traffic, but it would be helpful to have easier setup steps!

I added a simple guide in the documentation of stats module.

I also think something like this will be just as, if not even more, useful for user applications to record their data of interest!

That's really interesting! Maybe we can expose something so that users also can register their metrics and we export all of them. I need to think about that.

Seems like this should be something that is wrapped in a feature flag?

I will make it configurable in the toml file. But I'm not opposed to make it configurable in compile time as well.

@thearossman
Copy link
Collaborator

thearossman commented Jan 16, 2025

Confirming I ran the prometheus-websites example and it worked on ~110Gbps of our traffic! I haven't yet gotten up the prometheus server (I think that's a me problem).

I think it would still be nice to have a compile- or runtime feature flag to enable/disable this export. Could be in a later commit.

Thanks for the additions, and thanks for your patience with the review turnaround over the holidays

@HKalbasi
Copy link
Contributor Author

I think it would still be nice to have a compile- or runtime feature flag to enable/disable this export. Could be in a later commit.

I added a runtime configuration which disables the exporter server and the atomic counters (thread locals will still get incremented).

Thanks for the additions, and thanks for your patience with the review turnaround over the holidays

Thanks for maintaining this. I worked with many open source projects, and retina is easily among top 10% in term of maintenance.

@thearossman
Copy link
Collaborator

I haven't looked at the actual UI (looks like there's a fw/config issue on our I did confirm that running the websites-prometheus example with the Prometheus config set up could handle the ~110Gbps that was on our network at the time. I also ran this in offline mode and the data looks as expected.

I feel pretty good about merging this, especially since it can be switched on/off! Will leave it up for another day in case anyone else (@thegwan @tbarbette ?) has any thoughts.

@tbarbette
Copy link
Collaborator

Good for me ! Having compile-time flags for new features is important, else we'll end up with a bloatware like Zeek. And I hear about Prometheus around me, so it's nice, thanks !

@thearossman
Copy link
Collaborator

@HKalbasi if I merged this, how would you feel about changing the runtime config to a compile-time feature (similar to how the mlx5 feature is implemented) sometime in the future?

Thanks again, excited to have this!

@HKalbasi
Copy link
Contributor Author

I added a compile time feature for this. I kept the runtime configuration as well since we needed the port and ip configuration anyway and disabling the server in the absence of them is a sane default.

@thearossman thearossman merged commit c3d5bad into stanford-esrg:main Jan 22, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants