-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First cut of Diag gNOI micro service #9
Conversation
diag/diag.proto
Outdated
// operations on a set of ports. | ||
// 2- BURNING related RPCs: Used to perform a vendor-provided Burnin test on the | ||
// switch to ensure the box is ready to start serving traffic. Burnin tests | ||
// are typically run by HwOps in the datacenter, as part of turnup or repair |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should remove this sentence about HwOps as this is Google specific. Different companies will do things differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks for noticing.
diag/diag.proto
Outdated
// switch to ensure the box is ready to start serving traffic. Burnin tests | ||
// are typically run by HwOps in the datacenter, as part of turnup or repair | ||
// workflow. | ||
service Diag { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably note that these are only for stateless diagnostics activities (in other words, these activities will all end on their own). None of these should hold any state on the device. Stateful configuration should use the OpenConfig model with GNMI. I only mention this as other vendors support a stateful internal BERT and I am hoping we can expand the current OpenConfig models to fully support this. There is a test-signal boolean leaf defined at the moment though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure these are stateless operations and them failing/passing should not have a permanent artifact on the switch (unless there is something wrong HW-wise). Added a point on that to the comments.
However I am also interested to learn more about the statefull BERT you mentioned here. What is it and how/when it is used? Any pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is a pointer to the leaf:
https://github.com/openconfig/public/blob/e8981a183157a5f730f3d054da63f9314f43efb4/release/models/optical-transport/openconfig-terminal-device.yang#L754
Essentially it is very similar, but the devices support permanently turning on a test signal which sends PRBS from the port. This is useful during capacity adds and sometimes troubleshooting to see if a link or portion of a link is error free. The optical vendors I have seen just do this as stateful config instead of running for specified time.
// - When BERT is already in progress on any port specified by the request. | ||
// - In case of any low-level HW/SW internal errors. | ||
// The RPC returns an OK status of none of these situations is encountered. | ||
rpc StartBERT(StartBERTRequest) returns(StartBERTResponse) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious on your thoughts, but in another PR, I am thinking we should make sure there is an OpenConfig state leaf to describe that an interface is being used for testing. Since this is a service affecting activity, we would want a way for streaming telemetry to indicate that even though the interface may be UP/UP, it is currently under test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh definitely agree with that. This was actually in our todo. Since BERT operation are long lived operations, the oper state of an interface MUST be able to capture this.
Internally we actually do have a specific state for a port when it is in so called "diag mode". We can definitely extend gnmi to capture that as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually have a TESTING
state in /interfaces/interface/state/oper-status
already. Would this be OK for this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For optical devices, I think we also need this in the terminal-device/logical channels. We can expand this enum possibly:
https://github.com/openconfig/public/blob/e8981a183157a5f730f3d054da63f9314f43efb4/release/models/optical-transport/openconfig-terminal-device.yang#L767
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, for our use case TESTING state should be enough I think.
Addressed the comment. Thanks for the review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the changes. LGTM, but please wait for Rob/Anees to review.
diag/diag.proto
Outdated
// The Diag service exports to main set of RPCs: | ||
// 1- BERT related RPCs: Used to perform Bit Error Rate Test (BERT) | ||
// operations on a set of ports. | ||
// 2- BURNING related RPCs: Used to perform a vendor-provided Burnin test on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/BURNING/BURNIN/?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
diag/diag.proto
Outdated
// operations on a set of ports. | ||
// 2- BURNING related RPCs: Used to perform a vendor-provided Burnin test on the | ||
// switch to ensure the box is ready to start serving traffic. Burnin tests | ||
// are typically run in the datacenter, as part of turnup or repair workflow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe say typically run in the field
, since not all gNOI deployed devices will live in datacentres :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
diag/diag.proto
Outdated
// use this ID (as well as the list of the ports) to stop the BERT operation | ||
// and/or get the BERT results. This RPC is expected to return an error status | ||
// in the following situations: | ||
// - When BERT operation is not supported on any of the ports specified by |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean "not supported on one or more of the ports", or "not supported on all the ports"? I assume the former, but it might be worth clarifying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, clarified.
// - When BERT is already in progress on any port specified by the request. | ||
// - In case of any low-level HW/SW internal errors. | ||
// The RPC returns an OK status of none of these situations is encountered. | ||
rpc StartBERT(StartBERTRequest) returns(StartBERTResponse) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We actually have a TESTING
state in /interfaces/interface/state/oper-status
already. Would this be OK for this use case?
diag/diag.proto
Outdated
// BERT operation on any of the ports specified by the request. | ||
// The RPC returns an OK status of none of these situations is encountered. | ||
// Note that a BERT operation is considered completed if switch has a | ||
// record of it. Also note that it is OK to receive a stop request for a port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear what "if switch has a record of it" means. Does this mean to say that only BERT operations that the switch has the history of are considered completed. For example, a BERT operation that ran before the last reboot might be unknown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes you got it right. After a BERT operation is done, there is nothing in HW that keeps track of the completed BERTs. So it is up to the agent to keep track of them. The agent typically has a limited resource and keeps track of the last M BERTs. Also the storage used to keep track of BERTs might be volatile (wiped out after reboot). At any point of time if the switch knows about a previous BERT, it can consider this completed. Otherwise it is considered a new BERT.
diag/diag.proto
Outdated
// The selected PRBS generating polynomial for BERT. | ||
PrbsPolynomial prbs_polynomial = 2; // required | ||
// BERT duration in mins. Must be a positive number. | ||
uint32 test_duration_in_mins = 3; // required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer we specify this in seconds -- is there a reason to only allow integer minutes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason. The duration was given as mins before in CMAL case. But dont think it is a big deal to change the logic this much. Changed to seconds.
// Path to the interface corresponding to the port. | ||
gnoi.Path interface = 1; // required | ||
// The selected PRBS generating polynomial for BERT. | ||
PrbsPolynomial prbs_polynomial = 2; // required |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it required to be able to run different polynomials per port? I'm not clear what the use case for this would be.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chips support this. I see no harm in allowing this.
repeated PerPortResponse per_port_responses = 2; | ||
} | ||
|
||
message StopBERTRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's where I think things get a bit more complicated by having multiple ports be part of the same BERT request. If I stop a BERT only on one port, but it has a single operation ID for that BERT, then this means that the BERT is cancelled for a subset of the ports, but running for the remainder. The current proto requires the caller to always check all statuses of the underlying ports to know the overall status. It seems like 1:1 would simplify this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly believe the capability of running BERT on multiple port in one RPC is required for tools. As also explained above.
Regarding your example:
The user can start BERT on multiple ports and put them under the same operation ID. That is true. He/she can also stop BERT on one or more ports which were part of the same operation ID later. The success is determines based on the overall BER values captured on all the ports which we get via GetBERTResult RPC. So if some of the per port BERTs were stopped earlier, things should still be clear. The BERT results will tell how long BERT were run on those ports and what was the BER value on those. To me this seems clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we simplify the GetBERTResult by just having it return status for all the ports that were associated with the operation id (then don't need the two modes, all and per-port)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So as I understand it, the tools have a use case to get the results for all the ports that might be part of different operation IDs.
As you said we might be able to simplify the logic a bit. I added a TODO there. Happy to simplify in case there is no use case.
diag/diag.proto
Outdated
// Path to the interface corresponding to the port. | ||
gnoi.Path interface = 1; | ||
// BERT result get status for this port. Only if the status is | ||
// BERT_STATUS_OK the rest of the fields below are meaningful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest Only if the status is BERT_STATUS_OK are the rest of the fields meaninful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
diag/diag.proto
Outdated
// failing/passing should not leave any permanent artifact on the switch (unless | ||
// there is something wrong HW-wise). | ||
service Diag { | ||
// Starts BERT operation on a set of ports. Each BERT operation is uniquely |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we say port
here, do we mean interface
in OpenConfig parlance? Or do we mean a single physical port? I think it'd be good to clarify since it's not necessarily clear whether the gnoi.Path
messages should correspond to /interfaces/interface
entries or /components/component
entries of type PORT
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ports refer to a channelized port which we previously identified using a (slot, port, channel). In YANG, I think an interface is a good equivalent. I clarified this in this text.
Addressed Rob's comments. PTAL. |
diag/diag.proto
Outdated
// | ||
|
||
// This file defines the gNOI APIs used to perform diagnostic operations on a | ||
// switching box. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/switching box/network device/
Probably need to read through all of the comments to make sure they are generally applicable where appropriate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In looking through the comments, I think a spec doc is the right place for detailed explanations (e.g., error conditions, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replaced switching box and switch in general.
What does the second comment mean though? Are you suggesting to provide a spec doc for BERT? Sure, but is that not good to have some explanation here too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but my view is that it's good to have some minimal explanation here and reference the spec for more detailed explanation (easier to consume in a doc IMO). I think that change can be done in a subsequent PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a detailed documentation SGTM. I will definitely take care of it.
Note that this is just a beginning. I will still need to talk to many teams and do necessary changes.
diag/diag.proto
Outdated
// there is something wrong HW-wise). | ||
// Note: By "port" we refer to a channelized frontpanel or backplane port on a | ||
// chassis. In OpenConfig YANG models, there is a one-to-one relationship | ||
// between a port as used here and an "interface". Therefore, the "gnoi.Path" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in this case, shouldn't we avoid the use of port
in this service to avoid confusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Port is a well-known concept for BERT. I think we should mention it. Just referring to port as interface in the comment is going to be confusing too.
Here we are just trying to make sure people will know what the ports refer to "when they think about YANG models".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just think it could be a little confusing since we specify which "ports" to test by specifying a path to an interface -- also, the port-to-interface relationship isn't 1:1 as described here for example (when you have channelized ports).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you are saying. Yeah I need to clarify this more. I will make sure to take care if this in the next PR.
diag/diag.proto
Outdated
// Unique BERT operation ID specified by the client. Multiple BERTs run on | ||
// different ports can have the same BERT operation ID. This ID will be used | ||
// later to stop the operation and/or get its results. | ||
string bert_operation_id = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might it be easier to manage these ids on the client if they are just fixed width numerics ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with you. So far IDs have been used as strings in CMAL. Wanted to keep them consistent. But I like numerical IDs as well.
For now I add a TODO and see if numbers are acceptable, if so I will change.
repeated PerPortResponse per_port_responses = 2; | ||
} | ||
|
||
message StopBERTRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we simplify the GetBERTResult by just having it return status for all the ports that were associated with the operation id (then don't need the two modes, all and per-port)
This LGTM, thanks Ali. Please wait for Anees' comments before merging. |
Hi Anees I tried to address your comments in the last commit. For some reason I could not comment on the last comment regarding GetBertResult. I am replying here. So as I understand it, the tools have a use case to get the results for all the ports that might be part of different operation IDs. As you said we might be able to simplify the logic a bit. I added a TODO there. Happy to simplify in case there is no use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving, but suggest we should circle back on these TODOs (and the spec) soon.
diag/diag.proto
Outdated
// | ||
|
||
// This file defines the gNOI APIs used to perform diagnostic operations on a | ||
// switching box. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but my view is that it's good to have some minimal explanation here and reference the spec for more detailed explanation (easier to consume in a doc IMO). I think that change can be done in a subsequent PR.
diag/diag.proto
Outdated
// there is something wrong HW-wise). | ||
// Note: By "port" we refer to a channelized frontpanel or backplane port on a | ||
// chassis. In OpenConfig YANG models, there is a one-to-one relationship | ||
// between a port as used here and an "interface". Therefore, the "gnoi.Path" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just think it could be a little confusing since we specify which "ports" to test by specifying a path to an interface -- also, the port-to-interface relationship isn't 1:1 as described here for example (when you have channelized ports).
This PR defines a new gNOI micro service called Diag to perform diagnostic operations on a
switching box. At the moment the service provides APIs for BERT operation only. BURNIN will be added later.
The APIs defined here are meant o deprecate the BERT-related APIs defined in gnoi/layer2/layer2.proto. The new APIs are more extensive than the old ones and match use cases in a real prod environment.