Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[stormond] Added new dynamic field 'last_sync_time' to STATE_DB #535

Merged
merged 7 commits into from
Nov 27, 2024

Conversation

assrinivasan
Copy link
Contributor

@assrinivasan assrinivasan commented Aug 24, 2024

Description

Currently the Storage monitoring Daemon does not provide any information on when it last synced attributes of each storage device onto the database. This PR adds new dynamic field called last_sync_time that shows when STORAGE_INFO fields for each disk was last synced to STATE_DB. The value is in Unix timestamp format.

HLD Change PR: sonic-net/SONiC#1784

Motivation and Context

The motivation is that this field would provide additional context in case of repeated daemon crashes scenario ending up with the daemon status = FATAL Exited too quickly (process log may have details).

How Has This Been Tested?

Manually tested my changes on a DUT:

  1. Let the daemon run for 10 minutes and then stopped it to sim a FATAL scenario:
  2. Let some time pass:
root@sonic-device:/usr/local/bin# date +%s
1724487680

  1. Checked STATE_DB for when it was last updated and verified match with logs:
2024 Aug 24 07:55:50.392206 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Starting Storage Monitoring Daemon
2024 Aug 24 07:55:50.437050 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Storage Device: nvme0n1, Device Model: ATP AF240GSTJA-AW1, Serial: 23090240-000257
2024 Aug 24 07:55:50.437789 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Polling Interval set to 60 seconds
2024 Aug 24 07:55:50.437789 str3-7060x6-64pe-2 INFO pmon#stormond[906]: FSIO JSON file Interval set to 360 seconds
2024 Aug 24 07:55:50.481670 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Storage Device: nvme0n1, Firmware: 42A4SB6G, health: 100.0%, Temp: 27.0C, FS IO Reads: 95432, FS IO Writes: 18860
2024 Aug 24 07:55:50.481837 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Latest FSIO Reads: 95432, Latest FSIO Writes: 18860
2024 Aug 24 07:55:50.481837 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Disk IO Reads: 3,657,918 [1.87 TB], Disk IO Writes: 1,155,355 [591 GB], Reserved Blocks: 100.0
**2024 Aug 24 07:55:50.481837 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Last successful sync time to STATE_DB: 1724486150.4809144**
2024 Aug 24 07:56:00.045910 str3-7060x6-64pe-2 INFO pmon#supervisord 2024-08-24 07:56:00,044 INFO success: stormond entered RUNNING state, process has stayed up for > than 10 seconds (startsecs)
2024 Aug 24 07:56:50.483157 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Polling Interval set to 60 seconds
2024 Aug 24 07:56:50.483157 str3-7060x6-64pe-2 INFO pmon#stormond[906]: FSIO JSON file Interval set to 360 seconds
2024 Aug 24 07:56:50.531857 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Storage Device: nvme0n1, Firmware: 42A4SB6G, health: 100.0%, Temp: 27.0C, FS IO Reads: 95432, FS IO Writes: 18971
2024 Aug 24 07:56:50.531857 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Latest FSIO Reads: 95432, Latest FSIO Writes: 18971
2024 Aug 24 07:56:50.531857 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Disk IO Reads: 3,657,918 [1.87 TB], Disk IO Writes: 1,155,357 [591 GB], Reserved Blocks: 100.0
**2024 Aug 24 07:56:50.531956 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Last successful sync time to STATE_DB: 1724486210.5309312**
2024 Aug 24 07:57:41.743749 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Caught signal 'SIGTERM'
2024 Aug 24 07:57:41.743841 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Syncing total and latest procfs reads and writes from STATE_DB to JSON file
2024 Aug 24 07:57:41.743841 str3-7060x6-64pe-2 INFO pmon#supervisord 2024-08-24 07:57:41,743 INFO waiting for stormond to stop
2024 Aug 24 07:57:41.745303 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Exiting with SIGTERM
2024 Aug 24 07:57:41.745408 str3-7060x6-64pe-2 INFO pmon#stormond[906]: Shutting down Storage Monitoring Daemon
2024 Aug 24 07:57:41.832795 str3-7060x6-64pe-2 INFO pmon#supervisord 2024-08-24 07:57:41,832 WARN stopped: stormond (exit status 143)

root@sonic-device:/usr/local/bin# redis-cli -n 6 HGETALL "STORAGE_INFO|nvme0n1"
 1) "device_model"
 2) "ATP AF240GSTJA-AW1"
 3) "serial"
 4) "23090240-000257"
 5) "firmware"
 6) "42A4SB6G"
 7) "health"
 8) "100.0"
 9) "temperature"
10) "27.0"
11) "latest_fsio_reads"
12) "95432"
13) "latest_fsio_writes"
14) "18971"
15) "disk_io_reads"
16) "3,657,918 [1.87 TB]"
17) "disk_io_writes"
18) "1,155,357 [591 GB]"
19) "reserved_blocks"
20) "100.0"
21) "total_fsio_reads"
22) "95432"
23) "total_fsio_writes"
24) "18971"
25) "last_sync_time"
26) "1724486210.5309312"

Thus we can see that the storage attributes for the nvme0n1 disk have not been updated in (1724487680 - 1724486210.5309312) = 1469 seconds ~= 25 minutes.

@assrinivasan
Copy link
Contributor Author

/azpw run coverage.Azure.sonic-platform-daemons.build_test

@mssonicbld
Copy link
Collaborator

/AzurePipelines run coverage.Azure.sonic-platform-daemons.build_test

Copy link

No pipelines are associated with this pull request.

@assrinivasan
Copy link
Contributor Author

assrinivasan commented Nov 21, 2024

@prgeor @saiarcot895 failure appears to be a false flag? Code Coverage for scripts/stormond is at 83% as of the latest commit:

assrinivasan@dev-vm:sonic-buildimage/src/sonic-platform-daemons$ git log -1 --oneline d5c5969 (HEAD -> stormon-add-last-sync-time, origin/stormon-add-last-sync-time) Increased log level so that they are seen in syslogs

image

Am I missing something? EDIT: I was indeed missing something. Increased code coverage in latest commit.

@assrinivasan assrinivasan requested a review from prgeor November 21, 2024 21:19
@prgeor prgeor merged commit 3624cb7 into sonic-net:master Nov 27, 2024
5 checks passed
vvolam pushed a commit to vvolam/sonic-platform-daemons that referenced this pull request Jan 3, 2025
…c-net#535)

* Added new dynamic field 'last_sync_time' that shows when STORAGE_INFO for disk was last synced to STATE_DB

* Moved 'start' message to actual starting point of the daemon

* Added functions for formatted and epoch time for user friendly time display

* Made changes per prgeor review comments

* Pivot to SysLogger for all logging

* Increased log level so that they are seen in syslogs

* Code coverage improvement
prgeor pushed a commit that referenced this pull request Feb 6, 2025
…evice is in detaching mode (#546)

* Skip logging the warning, if device is in detaching mode

* Add detach_info table and unittests

* Fix unit tests

* Increase code coverage

* Remove unused header import

* Fix dict get values

* Increase code coverage

* Increase test coverage

* [SmartSwitch] Extend implementation of the DPU chassis daemon. (#563)

* Addition of DPU Chassis for thermalctld (#564)

* [stormond] Added new dynamic field 'last_sync_time' to STATE_DB (#535)

* Added new dynamic field 'last_sync_time' that shows when STORAGE_INFO for disk was last synced to STATE_DB

* Moved 'start' message to actual starting point of the daemon

* Added functions for formatted and epoch time for user friendly time display

* Made changes per prgeor review comments

* Pivot to SysLogger for all logging

* Increased log level so that they are seen in syslogs

* Code coverage improvement

* [lag_id] Add lagid to free_list when LC absent for 30 minutes (#542)

When LC is absent for 30 minutes, the database cleanup kicks in. When LagId is released, it needs to be appended to the SYSTEM_LAG_IDS_FREE_LIST

This PR works with the following 2 PRs:
sonic-net/sonic-swss#3303
sonic-net/sonic-buildimage#20369

Signed-off-by: mlok <[email protected]>

* Fixed bug in chassisd causing incorrect number of ASICs in CHASSIS_STATE_DB (#560)

Fixed the bug in chassisd due to which incorrect number of ASICs were being pushed to CHASSIS_STATE_DB.

* thermalctld: Add support for fans on non-CPU modules (#555)

* thermalctld: Add support for fans on non-CPU modules

* Add module fan to unit tests

* Advanced Azure pipeline to Bookworm (#572)

Description
This PR advances the azure pipeline on sonic_platform_daemons from bullseye to bookworm. This fixes the issue where sonic-platform-daemons azp is having some issues due to upgrade to bookworm. See Pipelines - Run 20241210.8 logs for details.

* Take non-CMIS xcvrs out of lpmode in SFF Manager (#565)

Description
Fix non-CMIS transceivers in down state by bringing them out of low power mode in the SFF Manager Task.
This is intended to work together with the change in sonic-net/sonic-buildimage#20886.

Motivation and Context
Non-CMIS transceivers were not functioning correctly when put into Low Power mode. So XCVRD now brings them out of lpmode.

How Has This Been Tested?
Loaded an image containing this change alongside the change from sonic-net/sonic-buildimage#20886 on an Arista chassis containing a Clearwater2 linecard.
Verified that without this image some interfaces were in a down state but with the image all interfaces came up as expected.

* Added SmartSwitch support in chassisd and enabling chassisd  (#467)

Added SmartSwitch support in chassisd and enabling chassisd

* [chassis][psud] Move the PSU parent information generation to the loop run function from the initialization function (#576)

Description
Move the PSU parent information generation to the loop run function from the initialization function

Motivation and Context
Fixes #575

How Has This Been Tested?
Tested on Cisco chassis, the PHYSICAL_ENTITY_INFO|PSU * can be re-inserted after thermalctld restart.
And monitored the stated db for memory for hours, works well:

* [chassisd] Address the chassisd crash issue and add UT for it (#573)

Description
On Nokia platform, slot name of Supervisor is string "A" instead of a number. Using "int" to convert it could cause issue backtrace. We should use slot value to any checking without any conversion. This will fixes sonic-net/sonic-buildimage#21131

Motivation and Context
Modify the _get_module_info not to convert "slot" to a string value. And also modify the code not to convert slot value to an to do any checking. Just directly use the returned value of get_slot(). Also add UT test_moduleupdater_check_slot_string() to valid it.

How Has This Been Tested?
Tested on 202405 branch


Signed-off-by: mlok <[email protected]>

* Fix a comment

---------

Signed-off-by: mlok <[email protected]>
Co-authored-by: Oleksandr Ivantsiv <[email protected]>
Co-authored-by: Gagan Punathil Ellath <[email protected]>
Co-authored-by: Ashwin Srinivasan <[email protected]>
Co-authored-by: Marty Y. Lok <[email protected]>
Co-authored-by: Vivek Verma <[email protected]>
Co-authored-by: Patrick MacArthur <[email protected]>
Co-authored-by: Peter Bailey <[email protected]>
Co-authored-by: rameshraghupathy <[email protected]>
Co-authored-by: Jianquan Ye <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants