Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to track time compaction jobs are queued #4980

Merged
merged 6 commits into from
Oct 18, 2024

Conversation

cshannon
Copy link
Contributor

@cshannon cshannon commented Oct 13, 2024

This adds 2 new groups of stats to track information about queued compaction jobs. The first stat is a timer that keeps track of when jobs are being polled and give information on how often/fast jobs are exiting the queue. The second group of stats is a min/max/avg and is tracking age information about how long jobs are waiting on the queue.

This closes #4945

This is a draft for now because there is a couple outstanding things to do and it probably still needs a little bit of polishing but I wanted to post what I had to get feedback. The tests could also be improved a bit I think. I was trying to think of a good way to check that the stats are correct but the values are going to be non-deterministic as it's all timing in based so it would be hard to check exact values.

Todo/questions:

  1. Should we record a time of 0 if there's a job immediately available in poll()?
  2. How should we handle async? I'm not sure the metrics really even apply to async as we are trying to track how long jobs are waiting on a queue before they are polled to check latency. However, if we were using async, then in theory the latency is near 0 as if there are compactors that are ready they don't need to poll and they'd just be waiting for the future to complete. Maybe we just record a time of 0 for async when a future is completed?
  3. What do we do with the timer on clear(), there is a TODO in the code.

This adds 2 new groups of stats to track information about queued
compaction jobs. The first stat is a timer that keeps track of when jobs
are being polled and give information on how often/fast jobs are
exiting the queue. The second group of stats is a min/max/avg
and is tracking age information about how long jobs are waiting
on the queue.

This closes apache#4945
Copy link
Contributor

@keith-turner keith-turner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we record a time of 0 if there's a job immediately available in both poll()?

I do not understand this question.

Maybe we just record a time of 0 for async when a future is completed?

I think that would be a good thing to do when future is successfully completed. That will cause the avg and count associated with the timer to reflect what is actually happening.

@cshannon
Copy link
Contributor Author

Should we record a time of 0 if there's a job immediately available in both poll()?

I do not understand this question.

Oops, I think that was supposed to say "Should we record a time of 0 if there's a job immediately in poll()" but that doesn't make sense either because we are using the jobAges map to record the delay so you can just ignore that question.

@keith-turner
Copy link
Contributor

I was trying to think of a good way to check that the stats are correct but the values are going to be non-deterministic as it's all timing in based so it would be hard to check exact values.

The best way I know to test these types of things is to add the ability to inject a time source. That is somewhat doable in unit test, but not realistically achievable in the ITs.

@cshannon cshannon marked this pull request as ready for review October 18, 2024 12:54
@cshannon
Copy link
Contributor Author

@keith-turner - I made a few changes and I think this is ready for review again:

  1. I updated async to record a time of 0 when jobs are immediately available.
  2. I realized we needed to update the removeOlderGenerations() method to clear out jobAges for any extents removed to prevent a memory leak for tablets that go away or don't need compactions.
  3. I updated the tests to check jobAges map contains the expected extents and is cleared when it should

@cshannon
Copy link
Contributor Author

One thing to note as I was working on this was I started to wonder if it made sense to get rid of the jobAges map and just move the Timer into the TabletJobs object as it's always a 1 to 1 relationship that we have to keep in sync. This would make managing it easier, but the primary downside is being unable to iterate over the timers to collect the current stats without locking. Right now with the separate map it's a concurrent hashmap so we can iterate and collect data as a snapshot in the memoized field and I think that makes it worthwhile to keep separate, even if it's more work to make things correct.

@cshannon cshannon merged commit ab5c57e into apache:main Oct 18, 2024
4 checks passed
@cshannon cshannon deleted the accumulo-4945 branch October 18, 2024 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add metrics to track time compaction jobs are queued
2 participants