Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update healthz documentation. #106

Merged
merged 10 commits into from
Nov 11, 2022
Merged

Update healthz documentation. #106

merged 10 commits into from
Nov 11, 2022

Conversation

robshakir
Copy link
Contributor

* (D) healthz/index.md
  - Merge with README.md
 * (M) healthz/README.md
  - Update README to cover latest changes to healthz proto.
  - Add additional documentation describing RPC implementations.
 * (M) healthz/healthz.proto
  - Editorial updates.

@hellt @marcushines -- PTAL. This updates the documentation with the decisions made in the implementation of streaming healthz.

 * (D) healthz/index.md
  - Merge with README.md
 * (M) healthz/README.md
  - Update README to cover latest changes to healthz proto.
  - Add additional documentation describing RPC implementations.
 * (M) healthz/healthz.proto
  - Editorial updates.
@robshakir robshakir requested a review from marcushines November 1, 2022 22:44
@coveralls
Copy link

coveralls commented Nov 1, 2022

Pull Request Test Coverage Report for Build 3438981640

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 0.0%

Totals Coverage Status
Change from base Build 3331306726: 0.0%
Covered Lines: 0
Relevant Lines: 0

💛 - Coveralls

Comment on lines +76 to +80
### Use of gNMI paths

Where a gNMI path (`gnoi.types.Path`) is used in `Healthz`, the path specified
should be the complete path to a specific component - i.e.,
`/components/component[name=FOO]`.
Copy link
Contributor

@hellt hellt Nov 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to clarify that these paths only address hardware components? There is a section under User Experience heading that talks about BGP service crash and the associated Healthz.Get() call.
IIRC, BGP is not exposed as a component? Is this a future thinking about the services components being part of the /components tree?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BGP itself would not necessarily be exposed as a component, but it is valid to have software processes in the /components tree. Based on this, I don't think that there is a need to restrict this if the implementation supports a health check on software processes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robshakir okay, my reading here is that it is not prescribed by healthz (or Openconfig) to expose software services under /components, moreover, the schema for /components container hints that it is meant to be used for HW components as it has sub-containers like chassis, CPU, etc.

Therefore, it seems okay to first focus on the hardware parts in the early days of healthz and approach SW components later

healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
Comment on lines 28 to 29
Following a health check occuring, a caller uses the `List` RPC to discover the
artifacts that are assoicated with a component and its corresponding
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems confusing to use the artifact term when we describe the ComponentStatus, as the artifact is a message encompassed in the status.
Later down the section we say that we acknowledge artifacts:

Once retrieved each artifact can be 'acknowledged' by the client of the RPC.

which confuses me, as we ack the ComponentStatus (which is also referred as event sometimes in this doc).

Shall we rather say ComponentStatus when describing the set of health status messages returned by Get/List RPCs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree -- I've tried to standardise that "event" means a ComponentStatus. I think that this is clearer, since ComponentStatus itself seems obsure from a human understanding. PTAL (will push this change after sending this comment and addressing the others).

Comment on lines 25 to 26
(through the `Check` RPC) or a system may report the results of a check that it
has initiated of its own accord.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re "a system may report the results of a check that it has initiated of its own accord.":

  1. when a system boots, should the relevant component manager perform initial health checks and thus generate ComponentStatus for components?
  2. When a certain component changes its health status from one state to another (think a tranceiver goes from healthy to unhealthy), should the relevant component manager record a ComponentStatus event with a distinct ID automatically?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would say it should look like this:

system boots
gnmi tree is initialized for things which have not run a check yet - they start in the unspecfied state.
During initialization of the subcomponents the system should perform initial health checks to put the components into a known state
if the component is unhealthy - generate an event
if the components is healthy - set state and move on

  1. yes if something goes unhealthly there should be an event - but how much work is done to "triage" that event is left to the vendor.

Comment on lines 38 to 39
relevant. The device itself is responsible for garbage collection any may,
if necessary, garbage collect artifacts that are not yet acknowledged.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we clarify if it is expected for a device to persist artifacts? or can they be cleared in case gNOI server restarts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they should not be cleared just because the gnoi server restarted

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a clarification -- I think generally these need to persist across component restarts too, since healthz information that corresponds to something like a rebooting linecard should not be removed as the LC reboots IMHO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so essentially we have to persist the events until they are garbage collected/acked.

healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
Copy link
Contributor

@hellt hellt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @robshakir @marcushines
I have suggested a few clarifications that you might consider valid.

PS. plus some linter/typo work

healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
@robshakir
Copy link
Contributor Author

@hellt @marcushines -- PTAL. This updated spec is ready to ship from my POV.

Copy link
Contributor

@hellt hellt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @robshakir
Looks good to me, I spotted a few nit typos, but content-wise it looks great

healthz/README.md Outdated Show resolved Hide resolved
healthz/README.md Outdated Show resolved Hide resolved
implementor to determine which artifacts are generated automatically by the
system immediately. The artifact collection SHOULD NOT impact the performance
of the system.
2. The responding client system observes the transition to `UNHEALTHY` and can
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an interesting idea
We would need to add an attribute which could signal there is "additional" work to be done to generate the artifacts

It is easy to say "show tech" at the chassis is expensive but really we should never be getting "chassis" level checks we should really be only getting the needed artifacts for specific events

You do raise a reasonable use case though - let's say a system database reports an error - instead of pausing the system which could be service impacting to dumb the db- you could mark the event get the non service impacting data then leave a mark on the event that says - if you want more you can call check(event id) and it will go generate the other data and link to id which could be done after the device is drained or otherwise safe to get more data... That seems reasonable not sure it is "required" as we don't have a use case exactly of that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I've added the event_id to the CheckRequest to indicate that artifacts should be collected for a previous event, and made the clarifications that we discussed.

@hellt PTAL at this too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robshakir where did you add the text/field about it? I can't seem to find new commits since 92516ec

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry -- this was my fault. I'd not spotted that the push needed a merge. PTAL at 8cd01e1

Comment on lines +196 to +202
The `CheckRequest` message includes an `event_id` field. When populated this
indicates that the `Check` should be performed for an event that has already
occurred within the system. The device should trigger artifact collection of
those artifacts that were not automatically collected. A `CheckRequest` for
a previous `event_id` MUST NOT overwrite previous artifacts that were collected
at the time of the event. The artifacts that are collected MUST be returned
in the artifact list for the event when reported by the `Get` or `List` RPCs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robshakir I wonder how this is supposed to work, given that (I think) the health of a component can't be checked for the past events as the state of the component may already be HEALTHY or the artifacts collected won't contain the same state that were relevant at the time of a healthy->unhealthy transition.

I am inclined to not stall the PR though, as currently in our system we don't anticipate having expensive artifact collection processes which would require Check to work on a past event_id...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

synced up offline
copying for historical purposes

you call Check and it gathers some artifact that is based on the current state. This might include historical logs in the case of dumping some database, or retrieving some coredump from the secondary control plane.
But there's no guarantee that it captured the state at the time. For that, we need the artifacts that are automatically created to be complete.

@robshakir robshakir merged commit 79709cd into main Nov 11, 2022
robshakir added a commit that referenced this pull request Nov 16, 2022
 * (M) **/*.pb.go
   - Regenerate protobufs.
 * (M) heaslthz/healthz.proto
   - Update version to reflect new leaf addition in healthz in #106.
@marcushines marcushines deleted the healthz-doc branch November 3, 2023 14:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants