-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for handling SAI failures in orchagent #762
Conversation
|
||
The failure handling function should return `task_success` when the failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation). | ||
|
||
2. Return `task_failed` -- No crash, no retry, not handled successfully. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the scenarios of failure and clarified the keep running behavior.
### 2.4 DB changes | ||
An ERROR_DB will be introduced to escalate the failures from orchagent to upper layers such as fpmsyncd. | ||
|
||
The schema of ERROR_DB is designed as follows: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a discussion in the design that it is necessary to avoid accumulating failures in ERROR_DB and consuming memory. To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers. Also included a possible implementation of this behavior.
|
||
The schema of ERROR_DB is designed as follows: | ||
``` | ||
ERROR_{{SAI_API}}_TABLE|entry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree, neither the SAI type nor the Orch type would be enough to represent the context of the failure. I think a better representation of the entries in ERROR_DB is to make them consistent with the APPL_DB entries where the SAI failure happens. For example, SAI failure could happen in SAI_ROUTE_API or SAI_NEXT_HOP_GROUP_API when conducting operations for APPL_DB entry ROUTE_TABLE:0.0.0.0/0
, but in either scenario, the corresponding key in ERROR_DB should be ERROR_ROUTE_TABLE:0.0.0.0/0
so that the upper layer could know the SAI failures happen in which call and do tasks accordingly.
|
||
To avoid accumulating failures in ERROR_DB and consuming memory, it is necessary to ensure that the upper layer properly consumes the entries in ERROR_DB. | ||
To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers. | ||
One possible implementation could be escalating failures to ERROR_DB when the input `context` is valid. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, removed this part.
The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure. | ||
|
||
To avoid accumulating failures in ERROR_DB and consuming memory, it is necessary to ensure that the upper layer properly consumes the entries in ERROR_DB. | ||
To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the design to clarify that the upstream processes are expected to handle the entries in ERROR_DB and remove them once handled. Clarified that the ERROR_DB should not keep growing with the assumption valid. Also mentioned that the features are not currently available in upstream processes and need to be added for proper failure handling.
"counter": {{count}} | ||
``` | ||
|
||
The table and key in ERROR_DB correspond to the table and key in APPL_DB where SAI failures happen (e.g., SAI failure happens when conducting operations for APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`, the corresponding key in ERROR_DB should be `ERROR_ROUTE_TABLE:0.0.0.0/0`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be more future proof to prefix the APP_DB name in the key. For the above example:
APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`, key in ERROR_DB should be `ERROR_APPL_DB_ROUTE_TABLE:0.0.0.0/0`
This would later allow me to have ERROR_DB entries for other sources like CONFIG_DB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, updated in the design doc.
The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure. | ||
|
||
The upstream processes are expected to consume the ERROR_DB entries and remove the handled failures from the ERROR_DB. | ||
Assuming the upstream processes have the proper consumption of ERROR_DB entries and failure handling logic (these are not currently available for upstreams processes and need to be added), the ERROR_DB should not keep accumulating failures in ERROR_DB and consuming memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When an entry is deleted from APPL_DB, any ERROR_DB entries with the same key could be cleared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be true in most scenarios since the ERROR_DB entries are designed to track the failure in applying APPL_DB entries or basically mismatch between APPL_DB and the hardware. When an entry is removed from ERROR_DB, the mismatch between APPL_DB and the hardware should get cleared, and most likely the corresponding ERROR_DB entry is no longer needed. Yet if the upstream process believes that extra operation is needed for the failure, I think the upstream process could have the freedom to hold the entry until the operations are done.
This PR contains the design for SAI failure handling in orchagent.