-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[VOQ] Fabric orchagent exit in Supervisor #15321
Comments
Adding more logs. With FABRIC poll, TIMEOUTs seen on random swss/.fabric asic on SUP on a fully populated chassis
|
@saksarav-nokia @mlok-nokia to check. |
#define FABRIC_PORT_STAT_COUNTER_FLEX_COUNTER_GROUP "FABRIC_PORT_STAT_COUNTER" The fabric polling interval is very aggressive considering every 10sec we poll for all fabric ports |
I find two areas which will need optimization and fix
(ii) The call to updateFabricPortState() is a heavy call, and here it is redundant https://github.com/sonic-net/sonic-swss/blob/master/orchagent/fabricportsorch.cpp#L360. The is because we already call updateFabricPortState() towards end of API getFabricPortList() which is called in doTask(). |
@arlakshm f.y.i |
@judyjoseph @arlakshm,
So the only way to address this issue to optimize aggressive polling during bootup or config reload. Thanks, |
cpm_syslog.log |
@kenneth-arista, please take a look at this issue |
@kenneth-arista , But looks like there is another issue with the fabric port counter. Even though all 192 ports are polled in every polling cycle and the duration to poll all 8 counters for each port is ~0.1 secs or less, we still see "syncd0#syncd: :- threadFunction: time span" logs for random few ports keeps in every polling cycle. When would we see this?. |
@saksarav-nokia can you paste the output of I believe setting @judyjoseph is correct in that there is a redundant call to We're gathering some data on our end. As a quick datapoint, we don't see orchagent restarts with config-reload nor during initial boot. However, it is not a fair comparison as we have fewer Ramons and fewer ports per Ramon. Tagging @jfeng-arista for awareness |
fabric_reach.txt |
@kenneth-arista is working on PR to create to remove extra loop on |
also should we check on enhancing |
Call to updateFabricPortState in FabricPortsOrch::getFabricPortList() is redundant as FabricPortsOrch::doTask() already calls it. This change helps mitigate the MHz spikes during boot up of the supe as described in sonic-net/sonic-buildimage#15321.
@saksarav-nokia looking at your To help mitigate orchagent restarts, I posted sonic-net/sonic-swss#2850 to remove the redundant code. However, let's gather more info on what stats are being polled and how long the operations take before changing the polling interval. |
@kenneth-arista , We have 16 Ramons with 192 SFM links in each Ramon. Since we have only 5 (out of 8) IMM cards inserted in this chassis, only 120 SFM links are up. But i see SONiC fabric polling code polls the status for all 192 links even if only 120 links are up. |
@saksarav-nokia can you propose a PR for changing the polling code because it's not productive for me to do it if I can't test it nor reproduce the problem. |
Call to updateFabricPortState in FabricPortsOrch::getFabricPortList() is redundant as FabricPortsOrch::doTask() already calls it. This change helps mitigate the MHz spikes during boot up of the supe as described in sonic-net/sonic-buildimage#15321.
Call to updateFabricPortState in FabricPortsOrch::getFabricPortList() is redundant as FabricPortsOrch::doTask() already calls it. This change helps mitigate the MHz spikes during boot up of the supe as described in sonic-net/sonic-buildimage#15321.
Another interesting observation ( I have taken port:0x1000000000122 here in the below example ). The SAI calls happens vey close like twice in subsequent seconds resulting in READ taking longer 1337 ms. Will need to check if there is some overlaps, or is it because the last polling of fabric ports did not complete and we have started the next loop etc
|
Call to updateFabricPortState in FabricPortsOrch::getFabricPortList() is redundant as FabricPortsOrch::doTask() already calls it. This change helps mitigate the MHz spikes during boot up of the supe as described in sonic-net/sonic-buildimage#15321.
Call to updateFabricPortState in FabricPortsOrch::getFabricPortList() is redundant as FabricPortsOrch::doTask() already calls it. This change helps mitigate the MHz spikes during boot up of the supe as described in sonic-net/sonic-buildimage#15321.
Call to updateFabricPortState in FabricPortsOrch::getFabricPortList() is redundant as FabricPortsOrch::doTask() already calls it. This change helps mitigate the MHz spikes during boot up of the supe as described in sonic-net/sonic-buildimage#15321.
Closing this issue as we don't see the orchagent exits with this PR #2850. Still fine tuning of counters are still needed for fabric ports -- to open a new issue, |
Description
Orchagent controlling the fabric asic exit seen on Nokia chassis supervisor due to TIMEOUT error. This is seen on a chassis with all the fabric cards inserted.
The CPU is high and continuous logs are seen in syslog "get:SAI_OBJECT_TYPE_PORT"
Steps to reproduce the issue:
Describe the results you received:
Describe the results you expected:
Output of
show version
:Output of
show techsupport
:Additional information you deem important (e.g. issue happens only occasionally):
The text was updated successfully, but these errors were encountered: