Design for handling SAI failures in orchagent #762

shi-su · 2021-03-17T16:15:19Z

This PR contains the design for SAI failure handling in orchagent.

qiluo-msft · 2021-03-24T00:19:19Z

doc/SAI_failure_handling/SAI_failure_handling.md

+
+The failure handling function should return `task_success` when the failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation).
+
+2. Return `task_failed` -- No crash, no retry, not handled successfully. 


task_failed

There are several causes for task_failed

user input is invalid, such as wrong ACL, conflicting IP addresses

hardware permanent error

internal logic error, but not critical, and we still want to keep dataplane working

etc. #Closed

You can mentioned 'no exit' or 'keep running'

In reply to: 600051934 [](ancestors = 600051934)

Added the scenarios of failure and clarified the keep running behavior.

qiluo-msft · 2021-03-24T00:28:55Z

doc/SAI_failure_handling/SAI_failure_handling.md

+### 2.4 DB changes
+An ERROR_DB will be introduced to escalate the failures from orchagent to upper layers such as fpmsyncd.
+
+The schema of ERROR_DB is designed as follows:


schema

If you accumulate them into a table, who will delete them? otherwise it will keep growing in memory. #Closed

Added a discussion in the design that it is necessary to avoid accumulating failures in ERROR_DB and consuming memory. To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers. Also included a possible implementation of this behavior.

qiluo-msft · 2021-03-24T00:30:21Z

doc/SAI_failure_handling/SAI_failure_handling.md

+
+The schema of ERROR_DB is designed as follows:
+```
+ERROR_{{SAI_API}}_TABLE|entry


SAI_API

Take route as an example, there are ROUTE SAI API and neighbor SAI API, but upper layer does not care which API fails. #Closed

Yes, I agree, neither the SAI type nor the Orch type would be enough to represent the context of the failure. I think a better representation of the entries in ERROR_DB is to make them consistent with the APPL_DB entries where the SAI failure happens. For example, SAI failure could happen in SAI_ROUTE_API or SAI_NEXT_HOP_GROUP_API when conducting operations for APPL_DB entry ROUTE_TABLE:0.0.0.0/0, but in either scenario, the corresponding key in ERROR_DB should be ERROR_ROUTE_TABLE:0.0.0.0/0 so that the upper layer could know the SAI failures happen in which call and do tasks accordingly.

qiluo-msft · 2021-05-21T18:52:15Z

doc/SAI_failure_handling/SAI_failure_handling.md

+
+To avoid accumulating failures in ERROR_DB and consuming memory, it is necessary to ensure that the upper layer properly consumes the entries in ERROR_DB.
+To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers.
+One possible implementation could be escalating failures to ERROR_DB when the input `context` is valid.


One possible implementation could be escalating failures to ERROR_DB when the input context is valid

The design purpose of context is not for this. #Closed

Agree, removed this part.

qiluo-msft · 2021-05-21T18:53:44Z

doc/SAI_failure_handling/SAI_failure_handling.md

+The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure.
+
+To avoid accumulating failures in ERROR_DB and consuming memory, it is necessary to ensure that the upper layer properly consumes the entries in ERROR_DB.
+To make sure all ERROR_DB entries are consumed, the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers.


the failure handling should only escalate failures when the corresponding handling mechanism is available in the upper layers

Does "up layers" mean upstreaming processes? If yes, I don't understand the statement. #Closed

Updated the design to clarify that the upstream processes are expected to handle the entries in ERROR_DB and remove them once handled. Clarified that the ERROR_DB should not keep growing with the assumption valid. Also mentioned that the features are not currently available in upstream processes and need to be added for proper failure handling.

rck-innovium · 2021-06-16T17:22:19Z

doc/SAI_failure_handling/SAI_failure_handling.md

+    "counter": {{count}}
+```
+
+The table and key in ERROR_DB correspond to the table and key in APPL_DB where SAI failures happen (e.g., SAI failure happens when conducting operations for APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`, the corresponding key in ERROR_DB should be `ERROR_ROUTE_TABLE:0.0.0.0/0`).


It would be more future proof to prefix the APP_DB name in the key. For the above example:

APPL_DB entry `ROUTE_TABLE:0.0.0.0/0`, key in ERROR_DB should be `ERROR_APPL_DB_ROUTE_TABLE:0.0.0.0/0`

This would later allow me to have ERROR_DB entries for other sources like CONFIG_DB

I agree, updated in the design doc.

rck-innovium · 2021-06-16T17:24:11Z

doc/SAI_failure_handling/SAI_failure_handling.md

+The field `counter` stores the number of failures for the same entry. It could be used as a reference for handling the failure.
+
+The upstream processes are expected to consume the ERROR_DB entries and remove the handled failures from the ERROR_DB.
+Assuming the upstream processes have the proper consumption of ERROR_DB entries and failure handling logic (these are not currently available for upstreams processes and need to be added), the ERROR_DB should not keep accumulating failures in ERROR_DB and consuming memory.


When an entry is deleted from APPL_DB, any ERROR_DB entries with the same key could be cleared?

I think this should be true in most scenarios since the ERROR_DB entries are designed to track the failure in applying APPL_DB entries or basically mismatch between APPL_DB and the hardware. When an entry is removed from ERROR_DB, the mismatch between APPL_DB and the hardware should get cleared, and most likely the corresponding ERROR_DB entry is no longer needed. Yet if the upstream process believes that extra operation is needed for the failure, I think the upstream process could have the freedom to hold the entry until the operations are done.

shi-su added 6 commits March 15, 2021 20:43

Doc for SAI failure handling

7788145

polish doc

2dae518

Remove comments and add warm boot support

dd0cd9d

Fix grammar issues

efd115f

Update return value type

93e5421

fix a typo

312e885

shi-su requested a review from qiluo-msft March 17, 2021 16:15

qiluo-msft reviewed Mar 24, 2021

View reviewed changes

Shi Su added 3 commits May 18, 2021 16:06

Update ERROR_DB design

f7efad1

Add handling for get operation and discussion for writing ERROR_DB

910126b

Minor fix

e9f1481

qiluo-msft reviewed May 21, 2021

View reviewed changes

Clarify ERROR_DB removal logic

22b0a57

shi-su requested a review from qiluo-msft May 26, 2021 01:15

qiluo-msft previously approved these changes May 26, 2021

View reviewed changes

rck-innovium reviewed Jun 16, 2021

View reviewed changes

Add DB type for ERROR_DN key

e181ec7

shi-su dismissed qiluo-msft’s stale review via e181ec7 June 21, 2021 02:53

qiluo-msft approved these changes Jul 22, 2021

View reviewed changes

shi-su merged commit 2ec017d into sonic-net:master Jul 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design for handling SAI failures in orchagent #762

Design for handling SAI failures in orchagent #762

shi-su commented Mar 17, 2021

qiluo-msft Mar 24, 2021 •

edited

Loading

qiluo-msft Mar 24, 2021

shi-su May 21, 2021

qiluo-msft Mar 24, 2021 •

edited

Loading

shi-su May 21, 2021

qiluo-msft Mar 24, 2021 •

edited

Loading

shi-su May 21, 2021

qiluo-msft May 21, 2021 •

edited

Loading

shi-su May 26, 2021

qiluo-msft May 21, 2021 •

edited

Loading

shi-su May 26, 2021

rck-innovium Jun 16, 2021

shi-su Jun 21, 2021

rck-innovium Jun 16, 2021

shi-su Jun 21, 2021


		The failure handling function should return `task_success` when the failure is properly handled without the need for another attempt (e.g., the SAI status is `SAI_STATUS_ITEM_NOT_FOUND` in remove operation).

		2. Return `task_failed` -- No crash, no retry, not handled successfully.

Design for handling SAI failures in orchagent #762

Design for handling SAI failures in orchagent #762

Conversation

shi-su commented Mar 17, 2021

qiluo-msft Mar 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiluo-msft Mar 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiluo-msft Mar 24, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiluo-msft May 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiluo-msft May 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

qiluo-msft Mar 24, 2021 •

edited

Loading

qiluo-msft Mar 24, 2021 •

edited

Loading

qiluo-msft Mar 24, 2021 •

edited

Loading

qiluo-msft May 21, 2021 •

edited

Loading

qiluo-msft May 21, 2021 •

edited

Loading