Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix several oximeter bugs #7126

Merged
merged 2 commits into from
Nov 21, 2024
Merged

Fix several oximeter bugs #7126

merged 2 commits into from
Nov 21, 2024

Conversation

bnaecker
Copy link
Collaborator

  • Only update producer information if it's actually changed. This fixes sled_data_link no longer returns any metrics #7120, since we no longer indefinitely delay a collection timer just by refreshing our list of producers from Nexus.
  • Bail from a producer refresh on any error contacting Nexus. This should fix oximeter is too eager to delete its entire state #7124. If we get a progenitor-level error refreshing our list from Nexus, which should mostly be a communcation error like ECONNREFUSED, we don't want to blow away our entire state. Instead, break out of the refresh function entirely, and continue collecting from our last known-good set of producers. Only proceed with the update when we are confident we have a full, valid list from Nexus.
  • Change the missed tick behavior for refreshing producer list from "burst" to "skip". This isn't strictly necessary, but better reflects what we want, which is "reasonably often" updates, without causing undue churn.

- Only update producer information if it's actually changed. This
  fixes #7120, since we no longer indefinitely delay a collection timer
  just by refreshing our list of producers from Nexus.
- Bail from a producer refresh on any error contacting Nexus. This
  should fix #7124. If we get a progenitor-level error refreshing our
  list from Nexus, which should mostly be a communcation error like
  `ECONNREFUSED`, we don't want to blow away our entire state. Instead,
  break out of the refresh function entirely, and continue collecting
  from our last known-good set of producers. Only proceed with the
  update when we are confident we have a full, valid list from Nexus.
- Change the missed tick behavior for refreshing producer list from
  "burst" to "skip". This isn't strictly necessary, but better reflects
  what we want, which is "reasonably often" updates, without causing
  undue churn.
@bnaecker bnaecker requested a review from jgallagher November 21, 2024 04:33
@bnaecker bnaecker merged commit 85605b0 into main Nov 21, 2024
16 checks passed
@bnaecker bnaecker deleted the oximeter-fixes branch November 21, 2024 18:43
jgallagher added a commit that referenced this pull request Dec 2, 2024
…ls (#7191)

#7126 introduced a `return;` inside `refresh_producer_list` to avoid
clobbering our producer list on an error talking to Nexus, but
`refresh_producer_list` had _two_ loops: it was both "do one refresh"
(from which `return`ing is correct) and also "periodically refresh"
(from which `return`ing is incorrect: it causes us to never refresh
again).

This PR splits `refresh_producer_list` into
`refresh_producer_list_{task,once}`; strongly recommend looking at the
diff with whitespace ignored, as it's basically a no-op other than this
split (which makes the `return` correct, as it only terminates a single
refresh and not all future ones).

I also added some `InlineErrorChain` bits to try to get more info from
logged errors.

Fixes #7190.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

oximeter is too eager to delete its entire state sled_data_link no longer returns any metrics
2 participants