Update healthz documentation. #106

robshakir · 2022-11-01T22:44:12Z

* (D) healthz/index.md
  - Merge with README.md
 * (M) healthz/README.md
  - Update README to cover latest changes to healthz proto.
  - Add additional documentation describing RPC implementations.
 * (M) healthz/healthz.proto
  - Editorial updates.

@hellt @marcushines -- PTAL. This updates the documentation with the decisions made in the implementation of streaming healthz.

* (D) healthz/index.md - Merge with README.md * (M) healthz/README.md - Update README to cover latest changes to healthz proto. - Add additional documentation describing RPC implementations. * (M) healthz/healthz.proto - Editorial updates.

coveralls · 2022-11-01T22:44:49Z

Pull Request Test Coverage Report for Build 3438981640

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 0.0%

Totals
Change from base Build 3331306726:	0.0%
Covered Lines:	0
Relevant Lines:	0

💛 - Coveralls

hellt · 2022-11-02T09:38:01Z

healthz/README.md

+### Use of gNMI paths
+
+Where a gNMI path (`gnoi.types.Path`) is used in `Healthz`, the path specified
+should be the complete path to a specific component - i.e.,
+`/components/component[name=FOO]`.


Does it make sense to clarify that these paths only address hardware components? There is a section under User Experience heading that talks about BGP service crash and the associated Healthz.Get() call.
IIRC, BGP is not exposed as a component? Is this a future thinking about the services components being part of the /components tree?

BGP itself would not necessarily be exposed as a component, but it is valid to have software processes in the /components tree. Based on this, I don't think that there is a need to restrict this if the implementation supports a health check on software processes.

@robshakir okay, my reading here is that it is not prescribed by healthz (or Openconfig) to expose software services under /components, moreover, the schema for /components container hints that it is meant to be used for HW components as it has sub-containers like chassis, CPU, etc.

Therefore, it seems okay to first focus on the hardware parts in the early days of healthz and approach SW components later

healthz/README.md

hellt · 2022-11-02T09:56:10Z

healthz/README.md

+Following a health check occuring, a caller uses the `List` RPC to discover the
+artifacts that are assoicated with a component and its corresponding


It seems confusing to use the artifact term when we describe the ComponentStatus, as the artifact is a message encompassed in the status.
Later down the section we say that we acknowledge artifacts:

Once retrieved each artifact can be 'acknowledged' by the client of the RPC.

which confuses me, as we ack the ComponentStatus (which is also referred as event sometimes in this doc).

Shall we rather say ComponentStatus when describing the set of health status messages returned by Get/List RPCs?

I agree -- I've tried to standardise that "event" means a ComponentStatus. I think that this is clearer, since ComponentStatus itself seems obsure from a human understanding. PTAL (will push this change after sending this comment and addressing the others).

hellt · 2022-11-02T10:39:18Z

healthz/README.md

+(through the `Check` RPC) or a system may report the results of a check that it
+has initiated of its own accord. 


re "a system may report the results of a check that it has initiated of its own accord.":

when a system boots, should the relevant component manager perform initial health checks and thus generate ComponentStatus for components?

When a certain component changes its health status from one state to another (think a tranceiver goes from healthy to unhealthy), should the relevant component manager record a ComponentStatus event with a distinct ID automatically?

i would say it should look like this:

system boots
gnmi tree is initialized for things which have not run a check yet - they start in the unspecfied state.
During initialization of the subcomponents the system should perform initial health checks to put the components into a known state
if the component is unhealthy - generate an event
if the components is healthy - set state and move on

yes if something goes unhealthly there should be an event - but how much work is done to "triage" that event is left to the vendor.

hellt · 2022-11-02T12:44:18Z

healthz/README.md

+relevant. The device itself is responsible for garbage collection any may,
+if necessary, garbage collect artifacts that are not yet acknowledged.


should we clarify if it is expected for a device to persist artifacts? or can they be cleared in case gNOI server restarts?

they should not be cleared just because the gnoi server restarted

Added a clarification -- I think generally these need to persist across component restarts too, since healthz information that corresponds to something like a rebooting linecard should not be removed as the LC reboots IMHO.

so essentially we have to persist the events until they are garbage collected/acked.

Co-authored-by: Roman Dodin <[email protected]>

healthz/README.md

hellt

Hi @robshakir @marcushines
I have suggested a few clarifications that you might consider valid.

PS. plus some linter/typo work

healthz/README.md

Co-authored-by: Roman Dodin <[email protected]>

robshakir · 2022-11-08T01:19:45Z

@hellt @marcushines -- PTAL. This updated spec is ready to ship from my POV.

hellt

Thanks @robshakir
Looks good to me, I spotted a few nit typos, but content-wise it looks great

healthz/README.md

Co-authored-by: Roman Dodin <[email protected]>

marcushines · 2022-11-09T11:11:11Z

healthz/README.md

+   implementor to determine which artifacts are generated automatically by the
+   system immediately. The artifact collection SHOULD NOT impact the performance
+   of the system.
+2. The responding client system observes the transition to `UNHEALTHY` and can


this is an interesting idea
We would need to add an attribute which could signal there is "additional" work to be done to generate the artifacts

It is easy to say "show tech" at the chassis is expensive but really we should never be getting "chassis" level checks we should really be only getting the needed artifacts for specific events

You do raise a reasonable use case though - let's say a system database reports an error - instead of pausing the system which could be service impacting to dumb the db- you could mark the event get the non service impacting data then leave a mark on the event that says - if you want more you can call check(event id) and it will go generate the other data and link to id which could be done after the device is drained or otherwise safe to get more data... That seems reasonable not sure it is "required" as we don't have a use case exactly of that

Agreed. I've added the event_id to the CheckRequest to indicate that artifacts should be collected for a previous event, and made the clarifications that we discussed.

@hellt PTAL at this too.

@robshakir where did you add the text/field about it? I can't seem to find new commits since 92516ec

I'm sorry -- this was my fault. I'd not spotted that the push needed a merge. PTAL at 8cd01e1

…-doc

hellt · 2022-11-10T18:05:17Z

healthz/README.md

+The `CheckRequest` message includes an `event_id` field. When populated this
+indicates that the `Check` should be performed for an event that has already
+occurred within the system. The device should trigger artifact collection of
+those artifacts that were not automatically collected. A `CheckRequest` for
+a previous `event_id` MUST NOT overwrite previous artifacts that were collected
+at the time of the event. The artifacts that are collected MUST be returned
+in the artifact list for the event when reported by the `Get` or `List` RPCs.


@robshakir I wonder how this is supposed to work, given that (I think) the health of a component can't be checked for the past events as the state of the component may already be HEALTHY or the artifacts collected won't contain the same state that were relevant at the time of a healthy->unhealthy transition.

I am inclined to not stall the PR though, as currently in our system we don't anticipate having expensive artifact collection processes which would require Check to work on a past event_id...

synced up offline
copying for historical purposes

you call Check and it gathers some artifact that is based on the current state. This might include historical logs in the case of dumping some database, or retrieving some coredump from the secondary control plane.
But there's no guarantee that it captured the state at the time. For that, we need the artifacts that are automatically created to be complete.

* (M) **/*.pb.go - Regenerate protobufs. * (M) heaslthz/healthz.proto - Update version to reflect new leaf addition in healthz in #106.

Update healthz documentation.

fab9500

* (D) healthz/index.md - Merge with README.md * (M) healthz/README.md - Update README to cover latest changes to healthz proto. - Add additional documentation describing RPC implementations. * (M) healthz/healthz.proto - Editorial updates.

robshakir requested a review from marcushines November 1, 2022 22:44

hellt reviewed Nov 2, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

hellt reviewed Nov 2, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

hellt reviewed Nov 2, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

hellt reviewed Nov 2, 2022

View reviewed changes

marcushines and others added 4 commits November 2, 2022 07:53

Update healthz/README.md

d2c5101

Co-authored-by: Roman Dodin <[email protected]>

Update healthz/README.md

a878c17

Co-authored-by: Roman Dodin <[email protected]>

Address review comments.

2d1d859

Add @hellt as a contributor.

f0e3a86

hellt reviewed Nov 3, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

hellt reviewed Nov 3, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

hellt reviewed Nov 3, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

hellt reviewed Nov 3, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

hellt reviewed Nov 3, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

hellt reviewed Nov 3, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

hellt reviewed Nov 3, 2022

View reviewed changes

robshakir and others added 2 commits November 3, 2022 06:18

Apply suggestions from code review

79095d3

Co-authored-by: Roman Dodin <[email protected]>

Add paragraph relating to client expectations for artifacts.

4ca66e9

hellt approved these changes Nov 8, 2022

View reviewed changes

healthz/README.md Outdated Show resolved Hide resolved

healthz/README.md Outdated Show resolved Hide resolved

Apply suggestions from code review

92516ec

Co-authored-by: Roman Dodin <[email protected]>

marcushines reviewed Nov 9, 2022

View reviewed changes

marcushines approved these changes Nov 10, 2022

View reviewed changes

robshakir added 2 commits November 10, 2022 09:21

Add event_id field.

8cd01e1

Merge branch 'healthz-doc' of github.com:openconfig/gnoi into healthz…

cc5cc27

…-doc

hellt reviewed Nov 10, 2022

View reviewed changes

hellt approved these changes Nov 10, 2022

View reviewed changes

robshakir merged commit 79709cd into main Nov 11, 2022

robshakir added a commit that referenced this pull request Nov 16, 2022

Update protobufs, bump healthz version.

26968c9

* (M) **/*.pb.go - Regenerate protobufs. * (M) heaslthz/healthz.proto - Update version to reflect new leaf addition in healthz in #106.

robshakir mentioned this pull request Nov 16, 2022

Update generated protobufs. #108

Merged

marcushines deleted the healthz-doc branch November 3, 2023 14:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update healthz documentation. #106

Update healthz documentation. #106

robshakir commented Nov 1, 2022

coveralls commented Nov 1, 2022 •

edited

Loading

hellt Nov 2, 2022 •

edited

Loading

robshakir Nov 2, 2022

hellt Nov 2, 2022

hellt Nov 2, 2022

robshakir Nov 2, 2022

hellt Nov 2, 2022

marcushines Nov 2, 2022

hellt Nov 2, 2022

marcushines Nov 2, 2022

robshakir Nov 2, 2022

hellt Nov 2, 2022

hellt left a comment

robshakir commented Nov 8, 2022

hellt left a comment

marcushines Nov 9, 2022

robshakir Nov 10, 2022

hellt Nov 10, 2022

robshakir Nov 10, 2022

hellt Nov 10, 2022

hellt Nov 10, 2022

		Following a health check occuring, a caller uses the `List` RPC to discover the
		artifacts that are assoicated with a component and its corresponding

		(through the `Check` RPC) or a system may report the results of a check that it
		has initiated of its own accord.

		relevant. The device itself is responsible for garbage collection any may,
		if necessary, garbage collect artifacts that are not yet acknowledged.

Update healthz documentation. #106

Update healthz documentation. #106

Conversation

robshakir commented Nov 1, 2022

coveralls commented Nov 1, 2022 • edited Loading

Pull Request Test Coverage Report for Build 3438981640

💛 - Coveralls

hellt Nov 2, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hellt left a comment

Choose a reason for hiding this comment

robshakir commented Nov 8, 2022

hellt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Nov 1, 2022 •

edited

Loading

hellt Nov 2, 2022 •

edited

Loading