-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update healthz documentation. #106
Conversation
* (D) healthz/index.md - Merge with README.md * (M) healthz/README.md - Update README to cover latest changes to healthz proto. - Add additional documentation describing RPC implementations. * (M) healthz/healthz.proto - Editorial updates.
Pull Request Test Coverage Report for Build 3438981640
💛 - Coveralls |
### Use of gNMI paths | ||
|
||
Where a gNMI path (`gnoi.types.Path`) is used in `Healthz`, the path specified | ||
should be the complete path to a specific component - i.e., | ||
`/components/component[name=FOO]`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to clarify that these paths only address hardware components? There is a section under User Experience heading that talks about BGP service crash and the associated Healthz.Get()
call.
IIRC, BGP is not exposed as a component? Is this a future thinking about the services components being part of the /components
tree?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BGP itself would not necessarily be exposed as a component, but it is valid to have software processes in the /components
tree. Based on this, I don't think that there is a need to restrict this if the implementation supports a health check on software processes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robshakir okay, my reading here is that it is not prescribed by healthz (or Openconfig) to expose software services under /components
, moreover, the schema for /components
container hints that it is meant to be used for HW components as it has sub-containers like chassis, CPU, etc.
Therefore, it seems okay to first focus on the hardware parts in the early days of healthz and approach SW components later
healthz/README.md
Outdated
Following a health check occuring, a caller uses the `List` RPC to discover the | ||
artifacts that are assoicated with a component and its corresponding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems confusing to use the artifact
term when we describe the ComponentStatus
, as the artifact is a message encompassed in the status.
Later down the section we say that we acknowledge artifacts:
Once retrieved each artifact can be 'acknowledged' by the client of the RPC.
which confuses me, as we ack the ComponentStatus
(which is also referred as event
sometimes in this doc).
Shall we rather say ComponentStatus
when describing the set of health status messages returned by Get
/List
RPCs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree -- I've tried to standardise that "event" means a ComponentStatus
. I think that this is clearer, since ComponentStatus
itself seems obsure from a human understanding. PTAL (will push this change after sending this comment and addressing the others).
healthz/README.md
Outdated
(through the `Check` RPC) or a system may report the results of a check that it | ||
has initiated of its own accord. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re "a system may report the results of a check that it has initiated of its own accord.":
- when a system boots, should the relevant component manager perform initial health checks and thus generate ComponentStatus for components?
- When a certain component changes its health status from one state to another (think a tranceiver goes from healthy to unhealthy), should the relevant component manager record a ComponentStatus event with a distinct ID automatically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would say it should look like this:
system boots
gnmi tree is initialized for things which have not run a check yet - they start in the unspecfied state.
During initialization of the subcomponents the system should perform initial health checks to put the components into a known state
if the component is unhealthy - generate an event
if the components is healthy - set state and move on
- yes if something goes unhealthly there should be an event - but how much work is done to "triage" that event is left to the vendor.
healthz/README.md
Outdated
relevant. The device itself is responsible for garbage collection any may, | ||
if necessary, garbage collect artifacts that are not yet acknowledged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we clarify if it is expected for a device to persist artifacts? or can they be cleared in case gNOI server restarts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they should not be cleared just because the gnoi server restarted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a clarification -- I think generally these need to persist across component restarts too, since healthz information that corresponds to something like a rebooting linecard should not be removed as the LC reboots IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so essentially we have to persist the events until they are garbage collected/acked.
Co-authored-by: Roman Dodin <[email protected]>
Co-authored-by: Roman Dodin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @robshakir @marcushines
I have suggested a few clarifications that you might consider valid.
PS. plus some linter/typo work
Co-authored-by: Roman Dodin <[email protected]>
@hellt @marcushines -- PTAL. This updated spec is ready to ship from my POV. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @robshakir
Looks good to me, I spotted a few nit typos, but content-wise it looks great
Co-authored-by: Roman Dodin <[email protected]>
implementor to determine which artifacts are generated automatically by the | ||
system immediately. The artifact collection SHOULD NOT impact the performance | ||
of the system. | ||
2. The responding client system observes the transition to `UNHEALTHY` and can |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is an interesting idea
We would need to add an attribute which could signal there is "additional" work to be done to generate the artifacts
It is easy to say "show tech" at the chassis is expensive but really we should never be getting "chassis" level checks we should really be only getting the needed artifacts for specific events
You do raise a reasonable use case though - let's say a system database reports an error - instead of pausing the system which could be service impacting to dumb the db- you could mark the event get the non service impacting data then leave a mark on the event that says - if you want more you can call check(event id) and it will go generate the other data and link to id which could be done after the device is drained or otherwise safe to get more data... That seems reasonable not sure it is "required" as we don't have a use case exactly of that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I've added the event_id
to the CheckRequest
to indicate that artifacts should be collected for a previous event, and made the clarifications that we discussed.
@hellt PTAL at this too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robshakir where did you add the text/field about it? I can't seem to find new commits since 92516ec
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry -- this was my fault. I'd not spotted that the push needed a merge. PTAL at 8cd01e1
The `CheckRequest` message includes an `event_id` field. When populated this | ||
indicates that the `Check` should be performed for an event that has already | ||
occurred within the system. The device should trigger artifact collection of | ||
those artifacts that were not automatically collected. A `CheckRequest` for | ||
a previous `event_id` MUST NOT overwrite previous artifacts that were collected | ||
at the time of the event. The artifacts that are collected MUST be returned | ||
in the artifact list for the event when reported by the `Get` or `List` RPCs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robshakir I wonder how this is supposed to work, given that (I think) the health of a component can't be checked for the past events as the state of the component may already be HEALTHY or the artifacts collected won't contain the same state that were relevant at the time of a healthy->unhealthy transition.
I am inclined to not stall the PR though, as currently in our system we don't anticipate having expensive artifact collection processes which would require Check
to work on a past event_id
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
synced up offline
copying for historical purposes
you call Check and it gathers some artifact that is based on the current state. This might include historical logs in the case of dumping some database, or retrieving some coredump from the secondary control plane.
But there's no guarantee that it captured the state at the time. For that, we need the artifacts that are automatically created to be complete.
* (M) **/*.pb.go - Regenerate protobufs. * (M) heaslthz/healthz.proto - Update version to reflect new leaf addition in healthz in #106.
@hellt @marcushines -- PTAL. This updates the documentation with the decisions made in the implementation of streaming healthz.