Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

syseepromd crashing repeatedly on SONiC.20201231.02 #8263

Closed
sumukhatv opened this issue Jul 27, 2021 · 3 comments · Fixed by #8339
Closed

syseepromd crashing repeatedly on SONiC.20201231.02 #8263

sumukhatv opened this issue Jul 27, 2021 · 3 comments · Fixed by #8339

Comments

@sumukhatv
Copy link
Contributor

Description

syseepromd is crashing on Arista-7060. Upon trying to restart syseepromd, it crashes over and over again.

Steps to reproduce the issue:

  1. Login to a Arista-7060CX-32S-D48C8 device running SONiC.20201231.02 which has reported the issue
  2. Restart syseepromd using supervisorctl inside pmon container
  3. Follow the syslog

Describe the results you received:

syslog.10.gz:Jul 21 20:34:28.913344 SONiC-01T0 INFO pmon#supervisord 2021-07-21 20:34:28,912 INFO exited: syseepromd (exit status 5; not expected)
syslog.10.gz:Jul 21 20:34:30.362112 SONiC-01T0 INFO pmon#supervisord 2021-07-21 20:34:30,361 INFO exited: syseepromd (exit status 5; not expected)
syslog.10.gz:Jul 21 20:34:32.867162 SONiC-01T0 INFO pmon#supervisord 2021-07-21 20:34:32,866 INFO exited: syseepromd (exit status 5; not expected)
syslog.10.gz:Jul 21 20:34:36.344441 SONiC-01T0 INFO pmon#supervisord 2021-07-21 20:34:36,343 INFO exited: syseepromd (exit status 5; not expected)

Describe the results you expected:

Output of show version:

SONiC Software Version: SONiC.20201231.02
Distribution: Debian 10.9
Kernel: 4.19.0-12-2-amd64
Build commit: 27645bc105
Build date: Fri Jun  4 11:25:23 UTC 2021
Built by: sonicbld@new-worker-3

Platform: x86_64-arista_7060_cx32s
HwSKU: Arista-7060CX-32S-D48C8
ASIC: broadcom
ASIC Count: 1
Serial Number: SSJ17492705

Output of show techsupport:

(paste your output here or download and attach the file here )

Additional information you deem important (e.g. issue happens only occasionally):

2021-06-11 16:57:44,074 INFO spawned: 'syseepromd' with pid 37
2021-06-11 16:57:44,240 INFO spawned: 'thermalctld' with pid 38
2021-06-11 16:57:44,359 INFO spawned: 'pcied' with pid 39
2021-06-11 16:57:46,272 INFO exited: lm-sensors (exit status 0; expected)
2021-06-11 16:57:50,399 INFO exited: syseepromd (exit status 5; not expected)
2021-06-11 16:57:51,403 INFO spawned: 'syseepromd' with pid 48
2021-06-11 16:57:52,044 INFO exited: syseepromd (exit status 5; not expected)
2021-06-11 16:57:54,125 INFO spawned: 'syseepromd' with pid 49
2021-06-11 16:57:54,558 INFO success: thermalctld entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2021-06-11 16:57:54,559 INFO success: pcied entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2021-06-11 16:57:54,559 INFO exited: syseepromd (exit status 5; not expected)
2021-06-11 16:57:57,782 INFO spawned: 'syseepromd' with pid 50
2021-06-11 16:57:58,307 INFO exited: syseepromd (exit status 5; not expected)
2021-06-11 16:57:58,308 INFO gave up: syseepromd entered FATAL state, too many start retries too quickly
@lguohan
Copy link
Collaborator

lguohan commented Jul 28, 2021

@sujinmkang , can you track this arista?

@Staphylo
Copy link
Collaborator

Just saw this issue, I'm looking into it.

@Staphylo
Copy link
Collaborator

The exit code of 5 means ERR_EEPROM_LOAD.
This happens because on 202012 we do not initialize ChassisBase._eeprom which is what gets returned by ChassisBase.get_eeprom().
This issue was fixed on master by making a few changes in our platform library but not on 202012.
I was already planning to merge a few fixes in 202012 this week on next week, so I'll also cherry-pick what is needed to fix this issue.

Arista-Jenkins pushed a commit to aristanetworks/sonic that referenced this issue Jul 30, 2021
This will prevent syseepromd from crashing as tracked by
sonic-net/sonic-buildimage#8263

Change-Id: Ic269d3e8e76e463e07c8d5477b7fb6fda08f347b
sujinmkang pushed a commit that referenced this issue Aug 6, 2021
This PR only contains backports from master

Fix leak discovered on master, though 202012 is not affected it's better to have the fix (fixes [master] thermalctld leak on Arista devices makes them unreachable when memory is exhausted #7515)
Fix EepromDecoderimplementation in the platform API (fixes syseepromd crashing repeatedly on SONiC.20201231.02 #8263)
Fix Mineral platform definition and configuration
Fix build issues in environments where /proc is not mounted/restricted (fixes PLATFORM=broadcom fails arista "ReloadCauseManagerTest" first time #7800)
Fix some pytest issues
Add sfp-eeprom C API and also mount it in pmon
@sujinmkang sujinmkang linked a pull request Aug 6, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants