Rev | Date | Author | Change Description |
---|---|---|---|
0.1 | Mykola Faryma | Initial version |
This document provides general information about the watermark feature implementation in SONiC.
This document describes the high level design of the watermark feature.
Definitions/Abbreviation | Description |
---|---|
gRPC | gRPC Remote Procedure Calls |
gNMI | gRPC Network Management Interface |
API | Application Programmable Interface |
SAI | Switch Abstraction Interface |
Following diagram describes a top level overview of the architecture:
System data telemetry infrastructure. Basically allows to getRequest data from SONiC DBs (and more).
Located in the Redis DB instance #2 running inside the container "database". Redis DB works with the data in format of key-value tuples, needs no predefined schema and holds various counters like port counters, ACL counters, etc.
This component is running in the "orchagent" docker container and is responsible for processing updates of the APP DB and do corresponding changes in the SAI DB via SAI Redis.
SAI Redis is an implementation of the SAI API which translates API calls into SAI objects which are stored in the ASIC DB.
Redis DB instance #1. Holds serialized SAI objects.
Reads SAI DB data (SAI objects) and performs appropriate calls to Switch SAI.
An unified API which represent the switch state as a set of objects. In SONiC represented in two implementations - SAI DB frontend and ASIC SDK wrapper.
The following watermarks should be supported:
SAI attribute mapping | |
---|---|
Ingress headroom per PG | SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_WATERMARK_BYTES |
Ingress shared pool occupancy per PG | SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_WATERMARK_BYTES |
Egress shared pool occupancy per queue (including both unicast queues and multicast queues) | SAI_QUEUE_STAT_SHARED_WATERMARK_BYTES |
System behavior: We consider a maximum of one regular user and a maximum of one special user that comes from streaming telemetry (grpc)
Streaming telemetry is only interested in periodic watermark, i.e., it queries the watermark at regular intervals. The interval is configurable. Streaming telemetry does not care about persistent watermark. Regular user is able to query the watermark. Regular user is able to reset the watermark. When the watermark is reset, watermark starts a new recording from the time reset is issued. Regular user is able to query the persistent watermark. Regular user is able to reset the persistent watermark. When the persistent watermark is reset, persistent watermark starts a new recording from the time reset is issued.
When one regular user and the streaming telemetry coexist, they do not interfere with each other. Their behaviors stay the same as described above. So the software should be able to handle the following situations and return the correct watermark values to each user:
t0 - clear user watermark event
t1 - show user watermark event. Shows highest watermark value for the period t0-t1
t2 - show user watermark event. Shows highest watermark value for the period t0-t2
t3 - clear perisitent watermark event
t4 - show persistent watermark event. Shows highest watermark value for the period t3-t4
t5 - show persistent watermark event. Shows highest watermark value for the period t3-t5
t6 - clear perisitent watermark event
t7 - clear user watermark event
t8 - show user watermark event. Shows highest watermark value for the period t7-t8
t9 - show persistent watermark event. Shows highest watermark value for the period t6-t9
- "COUNTERS:queue_vid"
- SAI_QUEUE_STAT_SHARED_WATERMARK_BYTES
- "COUNTERS:pg_vid"
- SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_WATERMARK_BYTES
- SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_WATERMARK_BYTES
- "COUNTERS_PG_PORT_MAP" - map PG oid to port oid
- "COUNTERS_PG_NAME_MAP" - map PG oid to PG name
- "COUNTERS_PG_INDEX_MAP" - map PG oid to PG index
The watermark counters are provided via Flex Counter, with a period of 1s. Flex Counter does clear the value from HW.
Table | Updated by | Cleared by | Used by | Purpose |
---|---|---|---|---|
COUNTERS | Flex counter | No need to clear, Flex Counter clears the value on HW every 1s(by default) and overwrites the DB | Lua plugins(Flex counter plugins) | Contains the counters updated by Flex counters |
PERIODIC_WATERMARKS | Flex counter lua plugins | Cleared on telemetry period (watermark orch handles the timer) | Used by Cli (show queue|priority-group watermark, accessible for telemetry via virtual path | Contains the telemetry watermarks |
PERSISTENT_WATERMARKS | Flex counter lua plugins | Cleared by user using clear Cli (clear queue|priority-group persistent-watermark) | Used by Cli (show queue|priority-group persistent-watermark), accessible for telemetry via virtual path | Contains the highest watermark from switch boot or last clear of persistent watermark |
USER_WATERMARKS | flex counter lua plugins | Cleared on user request (clear queue|priority-group watermark) | Used by CLI (show queue|priority-group watermark |
The structure of all three this tables is the same as COUNTERS table, but the hashes only contain watermark counters.
For example:
- "PERIODIC_WATERMARKS:queue_vid"
- "SAI_QUEUE_STAT_SHARED_WATERMARK_BYTES"
- "PERIODIC_WATERMARKS:pg_vid"
- "SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_WATERMARK_BYTES"
- "SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_WATERMARK_BYTES"
- "PERSISTENT_WATERMARKS:queue_vid"
- "SAI_QUEUE_STAT_SHARED_WATERMARK_BYTES"
- "PERSISTENT_WATERMARKS:pg_vid"
- "SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_WATERMARK_BYTES"
- "SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_WATERMARK_BYTES"
- "USER_WATERMARKS:queue_vid"
- "SAI_QUEUE_STAT_SHARED_WATERMARK_BYTES"
- "USER_WATERMARKS:pg_vid"
- "SAI_INGRESS_PRIORITY_GROUP_STAT_XOFF_ROOM_WATERMARK_BYTES"
- "SAI_INGRESS_PRIORITY_GROUP_STAT_SHARED_WATERMARK_BYTES"
The CLI flow does not incolve any logic, the cli only gets the data from a related table in DB (see table above). It does not do any comparison between watermark values.
New script and alias should be implemented to provide watermark values:
$ show priority-group [watermark|persistent-watermark] headroom
Ingress headroom per PG:
Interface PG0 PG1 PG2 PG3 PG4 PG5 PG6 PG7
Ethernet0 0 0 0 23 0 0 0 0
…
Ethernet128 0 0 0 0 0 0 0 0
$ show priority-group [watermark|persistent-watermark] shared
Ingress shared pool occupancy per PG:
Interface PG0 PG1 PG2 PG3 PG4 PG5 PG6 PG7
Ethernet0 0 1092 0 380 0 0 0 0
…
Ethernet128 0 0 0 0 0 0 0 0
$ show queue [watermark|persistent-watermark] unicast
Egress shared pool occupancy per unicast queue:
Interface UC0 UC1 UC2 UC3 UC4 UC5 UC6 UC7
Ethernet0 0 14 0 11 0 1 0 0
…
Ethernet128 0 0 0 0 0 0 0 0
$ show queue [watermark|persistent-watermark] multicast
Egress shared pool occupancy per multicast queue:
Interface MC0 MC1 MC2 MC3 MC4 MC5 MC6 MC7
Ethernet0 0 3 0 0 0 0 0 0
…
Ethernet128 0 0 0 0 0 0 0 0
In addition clear functionality will be added:
# clear priority-group [watermark|persistent-watermark] headroom
# clear priority-group [watermark|persistent-watermark] shared
# clear queue [watermark|persistent-watermark] unicast
# clear queue [watermark|persistent-watermark] mutlicast
The user can clear the persistent watermark, and the "user" watermark. The user can not clear the periodic(telemetry) watermark. The clear command requires sudo, as the watermark is shared for all users, and clear will affect every user(if a number of people are connected through ssh).
The telemetry interval will be available for viewing and configuring with the folowing CLI:
$ show watermark telemetry interval
# config watermark telemetry interval <value>
Note: after the new interval is configured, it will be changed only when the current telemetry interval ends.
In order to keep track of highest watermark plugins for queue and priority groups will be implemented. They will read the new watermark value from COUNTERS table, compare and overwrite the values in PERIODIC_WATERMARKS, PERSISTENT_WATERMARK and USER_WATERMARK table.
The plugin logic as pseudo code:
lua:
PERIODIC_WATERMARKS[object_vid][watermark_name] = max(COUNTERS[object_vid][watermark_name], PERIODIC_WATERMARKS[object_vid][watermark_name])
PERSISTENT_WATERMARK[object_vid][watermark_name] = max(COUNTERS[object_vid][watermark_name], PERSISTENT_WATERMARK[object_vid][watermark_name])
USER_WATERMARK[object_vid][watermark_name] = max(COUNTERS[object_vid][watermark_name], USER_WATERMARK[object_vid][watermark_name])
Portorch should be updated:
- implement new flex counter groups for queue and PG watermarks. This groups are configured with read and clear stats mode, meaning clear from HW every time it's read.
- implement PG to port map generation
New watermark orch should be implemented with the following functionality:
- Handle watermarks configuration, for example configuring TELEMETRY_INTERVAL.
- Listen to CLEAR_WATERMARK notification channel, handle clear watermark requests for USER_WATERMARKS and for HIGHEST_WATERMARKS for every type: PG_HEADROOM, PG_SHARED, QUEUE_UNICAST, QUEUE_MULTICAST. Clear request only means clearing the data from the related table.
- Create and manage a timer, which clears the telemetry watermark every TELEMETRY_INTERVAL.
Flex counter should be extended to support new PG counters.
Add new table WATERMARK_TABLE with fields like TELEMETRY_PERIOD
FlexCounter should be extended:
to collect PG stats. generate maps (PG to port, PG to index, PG to name) support a new attribute STATS_MODE use get_*_stats_ext() calls for counter collection to support read_and_clear stats mode. To for the stats mode the flex counter group schema will be extended:
- "POLL_INTERVAL"
- "1000"
- "STATS_MODE"
- "STATS_MODE_READ_AND_CLEAR"
- "FLEX_COUNTER_STATUS"
- "disable"
The sai APIs anf calls are:
-
sai_queue_api
sai_get_queue_stats_ext()
-
sai_buffer_api
sai_get_ingress_priority_group_stats_ext()
Sonic-telemetry will have acess to data in WATERMARK an HIGHEST_WATERMARK tables. For this the virtual db should be extended to access the said tables, virual path should should support mapping ports to queues and priority groups. The exact syntax of the virtual paths is TBD.
Examples of virtual paths:
COUNTERS_DB | "WATERMARKS/Ethernet*/Queues/PERIODIC_WATERMARKS" | Queue watermarks on all Ethernet ports |
COUNTERS_DB | "WATERMARKS/Ethernet<port number >/Queues/PERIODIC_WATERMARKS" |
Queue watermarks on one Ethernet ports |
COUNTERS_DB | "WATERMARKS/Ethernet*/PriorityGroups/PERIODIC_WATERMARKS" | PG watermarks on all Ethernet ports |
COUNTERS_DB | "WATERMARKS/Ethernet<port number >/PriorityGroups/PERIODIC_WATERMARKS" |
PG watermarks on one Ethernet ports |
COUNTERS_DB | "WATERMARKS/Ethernet*/Queues/PERSISTENT_WATERMARKS" | Queue highest watermarks on all Ethernet ports |
COUNTERS_DB | "WATERMARKS/Ethernet<port number >/Queues/PERSISTENT_WATERMARKS" |
Queue highest watermarks on one Ethernet ports |
COUNTERS_DB | "WATERMARKS/Ethernet*/PriorityGroups/PERSISTENT_WATERMARKS" | PG highest watermarks on all Ethernet ports |
COUNTERS_DB | "WATERMARKS/Ethernet<port number >/PriorityGroups/PERSISTENT_WATERMARKS" |
PG highest watermarks on one Ethernet ports |
The core components are the flex counter, watermark orch, DB, CLI.
The flex counter reads and clears the watermarks on a peroid of 1s by default. The values are put directly to COUNTERS table. The flex counter also has plugins configured for queue and pg, which will be triggered on every flex counter group interval. The lua plugin will update PERIODIC_WATERMARKS, PERSISTENT_WATERMARKS and USER_WATERMARKS with if the new value exceeds the vlaue that was read from the table.
The watermark orch has 2 main functions:
- Handle the Timer that clears the PERIODIC_WATERMARKS table. Handle the configuring of the interval for the timer.
- Handle Clear notificatons. On clear event the orch should just zero-out the corresponding watermarks from the table. It will be soon repopulated by lua plugin.
The DB contains all the tables with watemarks, and the configuration table.
The Cli reads the watermarks from the tables, formats and outputs it.
The watermark orch handles notifications on changes in WATERMARK_TABLE in config DB. The new interval will be assigned to the timer during the timer handling, so the orch will reset the interval only when the current timer expires.