Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for handling SAI failures in orchagent #762

Merged
merged 11 commits into from
Jul 22, 2021
Merged

Conversation

shi-su
Copy link
Contributor

@shi-su shi-su commented Mar 17, 2021

This PR contains the design for SAI failure handling in orchagent.

@shi-su shi-su requested a review from qiluo-msft March 17, 2021 16:15

The failure handling function should return `task_success` when the failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation).

2. Return `task_failed` -- No crash, no retry, not handled successfully.
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

task_failed

There are several causes for task_failed

  1. user input is invalid, such as wrong ACL, conflicting IP addresses
  2. hardware permanent error
  3. internal logic error, but not critical, and we still want to keep dataplane working
  4. etc. #Closed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can mentioned 'no exit' or 'keep running'


In reply to: 600051934 [](ancestors = 600051934)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the scenarios of failure and clarified the keep running behavior.

### 2.4 DB changes
An ERROR_DB will be introduced to escalate the failures from orchagent to upper layers such as fpmsyncd.

The schema of ERROR_DB is designed as follows:
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema

If you accumulate them into a table, who will delete them? otherwise it will keep growing in memory. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a discussion in the design that it is necessary to avoid accumulating failures in ERROR_DB and consuming memory. To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers. Also included a possible implementation of this behavior.


The schema of ERROR_DB is designed as follows:
```
ERROR_{{SAI_API}}_TABLE|entry
Copy link
Contributor

@qiluo-msft qiluo-msft Mar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SAI_API

Take route as an example, there are ROUTE SAI API and neighbor SAI API, but upper layer does not care which API fails. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree, neither the SAI type nor the Orch type would be enough to represent the context of the failure. I think a better representation of the entries in ERROR_DB is to make them consistent with the APPL_DB entries where the SAI failure happens. For example, SAI failure could happen in SAI_ROUTE_API or SAI_NEXT_HOP_GROUP_API when conducting operations for APPL_DB entry ROUTE_TABLE:0.0.0.0/0, but in either scenario, the corresponding key in ERROR_DB should be ERROR_ROUTE_TABLE:0.0.0.0/0 so that the upper layer could know the SAI failures happen in which call and do tasks accordingly.


To avoid accumulating failures in ERROR_DB and consuming memory, it is necessary to ensure that the upper layer properly consumes the entries in ERROR_DB.
To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers.
One possible implementation could be escalating failures to ERROR_DB when the input `context` is valid.
Copy link
Contributor

@qiluo-msft qiluo-msft May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One possible implementation could be escalating failures to ERROR_DB when the input context is valid

The design purpose of context is not for this. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, removed this part.

The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure.

To avoid accumulating failures in ERROR_DB and consuming memory, it is necessary to ensure that the upper layer properly consumes the entries in ERROR_DB.
To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers.
Copy link
Contributor

@qiluo-msft qiluo-msft May 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers

Does "up layers" mean upstreaming processes? If yes, I don't understand the statement. #Closed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the design to clarify that the upstream processes are expected to handle the entries in ERROR_DB and remove them once handled. Clarified that the ERROR_DB should not keep growing with the assumption valid. Also mentioned that the features are not currently available in upstream processes and need to be added for proper failure handling.

@shi-su shi-su requested a review from qiluo-msft May 26, 2021 01:15
qiluo-msft
qiluo-msft previously approved these changes May 26, 2021
"counter": {{count}}
```

The table and key in ERROR_DB correspond to the table and key in APPL_DB where SAI failures happen (e.g., SAI failure happens when conducting operations for APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`, the corresponding key in ERROR_DB should be `ERROR_ROUTE_TABLE:0.0.0.0/0`).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be more future proof to prefix the APP_DB name in the key. For the above example:

APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`,  key in ERROR_DB should be `ERROR_APPL_DB_ROUTE_TABLE:0.0.0.0/0`

This would later allow me to have ERROR_DB entries for other sources like CONFIG_DB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, updated in the design doc.

The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure.

The upstream processes are expected to consume the ERROR_DB entries and remove the handled failures from the ERROR_DB.
Assuming the upstream processes have the proper consumption of ERROR_DB entries and failure handling logic (these are not currently available for upstreams processes and need to be added), the ERROR_DB should not keep accumulating failures in ERROR_DB and consuming memory.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When an entry is deleted from APPL_DB, any ERROR_DB entries with the same key could be cleared?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be true in most scenarios since the ERROR_DB entries are designed to track the failure in applying APPL_DB entries or basically mismatch between APPL_DB and the hardware. When an entry is removed from ERROR_DB, the mismatch between APPL_DB and the hardware should get cleared, and most likely the corresponding ERROR_DB entry is no longer needed. Yet if the upstream process believes that extra operation is needed for the failure, I think the upstream process could have the freedom to hold the entry until the operations are done.

@shi-su shi-su merged commit 2ec017d into sonic-net:master Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants