Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First cut of Diag gNOI micro service #9

Merged
merged 5 commits into from
Jun 4, 2018

Conversation

aghaffarkhah
Copy link
Contributor

This PR defines a new gNOI micro service called Diag to perform diagnostic operations on a
switching box. At the moment the service provides APIs for BERT operation only. BURNIN will be added later.

The APIs defined here are meant o deprecate the BERT-related APIs defined in gnoi/layer2/layer2.proto. The new APIs are more extensive than the old ones and match use cases in a real prod environment.

diag/diag.proto Outdated
// operations on a set of ports.
// 2- BURNING related RPCs: Used to perform a vendor-provided Burnin test on the
// switch to ensure the box is ready to start serving traffic. Burnin tests
// are typically run by HwOps in the datacenter, as part of turnup or repair
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should remove this sentence about HwOps as this is Google specific. Different companies will do things differently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks for noticing.

diag/diag.proto Outdated
// switch to ensure the box is ready to start serving traffic. Burnin tests
// are typically run by HwOps in the datacenter, as part of turnup or repair
// workflow.
service Diag {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably note that these are only for stateless diagnostics activities (in other words, these activities will all end on their own). None of these should hold any state on the device. Stateful configuration should use the OpenConfig model with GNMI. I only mention this as other vendors support a stateful internal BERT and I am hoping we can expand the current OpenConfig models to fully support this. There is a test-signal boolean leaf defined at the moment though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure these are stateless operations and them failing/passing should not have a permanent artifact on the switch (unless there is something wrong HW-wise). Added a point on that to the comments.

However I am also interested to learn more about the statefull BERT you mentioned here. What is it and how/when it is used? Any pointers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a pointer to the leaf:
https://github.com/openconfig/public/blob/e8981a183157a5f730f3d054da63f9314f43efb4/release/models/optical-transport/openconfig-terminal-device.yang#L754

Essentially it is very similar, but the devices support permanently turning on a test signal which sends PRBS from the port. This is useful during capacity adds and sometimes troubleshooting to see if a link or portion of a link is error free. The optical vendors I have seen just do this as stateful config instead of running for specified time.

// - When BERT is already in progress on any port specified by the request.
// - In case of any low-level HW/SW internal errors.
// The RPC returns an OK status of none of these situations is encountered.
rpc StartBERT(StartBERTRequest) returns(StartBERTResponse) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious on your thoughts, but in another PR, I am thinking we should make sure there is an OpenConfig state leaf to describe that an interface is being used for testing. Since this is a service affecting activity, we would want a way for streaming telemetry to indicate that even though the interface may be UP/UP, it is currently under test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh definitely agree with that. This was actually in our todo. Since BERT operation are long lived operations, the oper state of an interface MUST be able to capture this.

Internally we actually do have a specific state for a port when it is in so called "diag mode". We can definitely extend gnmi to capture that as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have a TESTING state in /interfaces/interface/state/oper-status already. Would this be OK for this use case?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For optical devices, I think we also need this in the terminal-device/logical channels. We can expand this enum possibly:
https://github.com/openconfig/public/blob/e8981a183157a5f730f3d054da63f9314f43efb4/release/models/optical-transport/openconfig-terminal-device.yang#L767

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, for our use case TESTING state should be enough I think.

@aghaffarkhah
Copy link
Contributor Author

Addressed the comment. Thanks for the review.

Copy link
Contributor

@ejbrever ejbrever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the changes. LGTM, but please wait for Rob/Anees to review.

diag/diag.proto Outdated
// The Diag service exports to main set of RPCs:
// 1- BERT related RPCs: Used to perform Bit Error Rate Test (BERT)
// operations on a set of ports.
// 2- BURNING related RPCs: Used to perform a vendor-provided Burnin test on the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/BURNING/BURNIN/?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

diag/diag.proto Outdated
// operations on a set of ports.
// 2- BURNING related RPCs: Used to perform a vendor-provided Burnin test on the
// switch to ensure the box is ready to start serving traffic. Burnin tests
// are typically run in the datacenter, as part of turnup or repair workflow.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say typically run in the field, since not all gNOI deployed devices will live in datacentres :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

diag/diag.proto Outdated
// use this ID (as well as the list of the ports) to stop the BERT operation
// and/or get the BERT results. This RPC is expected to return an error status
// in the following situations:
// - When BERT operation is not supported on any of the ports specified by
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean "not supported on one or more of the ports", or "not supported on all the ports"? I assume the former, but it might be worth clarifying.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, clarified.

// - When BERT is already in progress on any port specified by the request.
// - In case of any low-level HW/SW internal errors.
// The RPC returns an OK status of none of these situations is encountered.
rpc StartBERT(StartBERTRequest) returns(StartBERTResponse) {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually have a TESTING state in /interfaces/interface/state/oper-status already. Would this be OK for this use case?

diag/diag.proto Outdated
// BERT operation on any of the ports specified by the request.
// The RPC returns an OK status of none of these situations is encountered.
// Note that a BERT operation is considered completed if switch has a
// record of it. Also note that it is OK to receive a stop request for a port
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear what "if switch has a record of it" means. Does this mean to say that only BERT operations that the switch has the history of are considered completed. For example, a BERT operation that ran before the last reboot might be unknown?

Copy link
Contributor Author

@aghaffarkhah aghaffarkhah May 29, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you got it right. After a BERT operation is done, there is nothing in HW that keeps track of the completed BERTs. So it is up to the agent to keep track of them. The agent typically has a limited resource and keeps track of the last M BERTs. Also the storage used to keep track of BERTs might be volatile (wiped out after reboot). At any point of time if the switch knows about a previous BERT, it can consider this completed. Otherwise it is considered a new BERT.

diag/diag.proto Outdated
// The selected PRBS generating polynomial for BERT.
PrbsPolynomial prbs_polynomial = 2; // required
// BERT duration in mins. Must be a positive number.
uint32 test_duration_in_mins = 3; // required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer we specify this in seconds -- is there a reason to only allow integer minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason. The duration was given as mins before in CMAL case. But dont think it is a big deal to change the logic this much. Changed to seconds.

// Path to the interface corresponding to the port.
gnoi.Path interface = 1; // required
// The selected PRBS generating polynomial for BERT.
PrbsPolynomial prbs_polynomial = 2; // required
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it required to be able to run different polynomials per port? I'm not clear what the use case for this would be.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chips support this. I see no harm in allowing this.

repeated PerPortResponse per_port_responses = 2;
}

message StopBERTRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's where I think things get a bit more complicated by having multiple ports be part of the same BERT request. If I stop a BERT only on one port, but it has a single operation ID for that BERT, then this means that the BERT is cancelled for a subset of the ports, but running for the remainder. The current proto requires the caller to always check all statuses of the underlying ports to know the overall status. It seems like 1:1 would simplify this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly believe the capability of running BERT on multiple port in one RPC is required for tools. As also explained above.

Regarding your example:
The user can start BERT on multiple ports and put them under the same operation ID. That is true. He/she can also stop BERT on one or more ports which were part of the same operation ID later. The success is determines based on the overall BER values captured on all the ports which we get via GetBERTResult RPC. So if some of the per port BERTs were stopped earlier, things should still be clear. The BERT results will tell how long BERT were run on those ports and what was the BER value on those. To me this seems clear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we simplify the GetBERTResult by just having it return status for all the ports that were associated with the operation id (then don't need the two modes, all and per-port)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as I understand it, the tools have a use case to get the results for all the ports that might be part of different operation IDs.

As you said we might be able to simplify the logic a bit. I added a TODO there. Happy to simplify in case there is no use case.

diag/diag.proto Outdated
// Path to the interface corresponding to the port.
gnoi.Path interface = 1;
// BERT result get status for this port. Only if the status is
// BERT_STATUS_OK the rest of the fields below are meaningful.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest Only if the status is BERT_STATUS_OK are the rest of the fields meaninful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

diag/diag.proto Outdated
// failing/passing should not leave any permanent artifact on the switch (unless
// there is something wrong HW-wise).
service Diag {
// Starts BERT operation on a set of ports. Each BERT operation is uniquely
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we say port here, do we mean interface in OpenConfig parlance? Or do we mean a single physical port? I think it'd be good to clarify since it's not necessarily clear whether the gnoi.Path messages should correspond to /interfaces/interface entries or /components/component entries of type PORT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ports refer to a channelized port which we previously identified using a (slot, port, channel). In YANG, I think an interface is a good equivalent. I clarified this in this text.

@aghaffarkhah
Copy link
Contributor Author

Addressed Rob's comments. PTAL.

diag/diag.proto Outdated
//

// This file defines the gNOI APIs used to perform diagnostic operations on a
// switching box.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/switching box/network device/

Probably need to read through all of the comments to make sure they are generally applicable where appropriate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In looking through the comments, I think a spec doc is the right place for detailed explanations (e.g., error conditions, etc.).

Copy link
Contributor Author

@aghaffarkhah aghaffarkhah May 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced switching box and switch in general.

What does the second comment mean though? Are you suggesting to provide a spec doc for BERT? Sure, but is that not good to have some explanation here too?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but my view is that it's good to have some minimal explanation here and reference the spec for more detailed explanation (easier to consume in a doc IMO). I think that change can be done in a subsequent PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a detailed documentation SGTM. I will definitely take care of it.

Note that this is just a beginning. I will still need to talk to many teams and do necessary changes.

diag/diag.proto Outdated
// there is something wrong HW-wise).
// Note: By "port" we refer to a channelized frontpanel or backplane port on a
// chassis. In OpenConfig YANG models, there is a one-to-one relationship
// between a port as used here and an "interface". Therefore, the "gnoi.Path"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this case, shouldn't we avoid the use of port in this service to avoid confusion?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Port is a well-known concept for BERT. I think we should mention it. Just referring to port as interface in the comment is going to be confusing too.

Here we are just trying to make sure people will know what the ports refer to "when they think about YANG models".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think it could be a little confusing since we specify which "ports" to test by specifying a path to an interface -- also, the port-to-interface relationship isn't 1:1 as described here for example (when you have channelized ports).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying. Yeah I need to clarify this more. I will make sure to take care if this in the next PR.

diag/diag.proto Outdated
// Unique BERT operation ID specified by the client. Multiple BERTs run on
// different ports can have the same BERT operation ID. This ID will be used
// later to stop the operation and/or get its results.
string bert_operation_id = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might it be easier to manage these ids on the client if they are just fixed width numerics ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you. So far IDs have been used as strings in CMAL. Wanted to keep them consistent. But I like numerical IDs as well.

For now I add a TODO and see if numbers are acceptable, if so I will change.

repeated PerPortResponse per_port_responses = 2;
}

message StopBERTRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we simplify the GetBERTResult by just having it return status for all the ports that were associated with the operation id (then don't need the two modes, all and per-port)

@robshakir
Copy link
Contributor

This LGTM, thanks Ali. Please wait for Anees' comments before merging.

@aghaffarkhah
Copy link
Contributor Author

Hi Anees I tried to address your comments in the last commit.

For some reason I could not comment on the last comment regarding GetBertResult. I am replying here. So as I understand it, the tools have a use case to get the results for all the ports that might be part of different operation IDs.

As you said we might be able to simplify the logic a bit. I added a TODO there. Happy to simplify in case there is no use case.

Copy link
Contributor

@aashaikh aashaikh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but suggest we should circle back on these TODOs (and the spec) soon.

diag/diag.proto Outdated
//

// This file defines the gNOI APIs used to perform diagnostic operations on a
// switching box.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, but my view is that it's good to have some minimal explanation here and reference the spec for more detailed explanation (easier to consume in a doc IMO). I think that change can be done in a subsequent PR.

diag/diag.proto Outdated
// there is something wrong HW-wise).
// Note: By "port" we refer to a channelized frontpanel or backplane port on a
// chassis. In OpenConfig YANG models, there is a one-to-one relationship
// between a port as used here and an "interface". Therefore, the "gnoi.Path"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just think it could be a little confusing since we specify which "ports" to test by specifying a path to an interface -- also, the port-to-interface relationship isn't 1:1 as described here for example (when you have channelized ports).

@aghaffarkhah aghaffarkhah merged commit 2f20d5c into openconfig:master Jun 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants