-
Notifications
You must be signed in to change notification settings - Fork 180
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new platform management API. Currently with support for chassis, PSUs, fans and watchdog #13
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good overall.
There will be some overlap with existing drivers and errors codes / exceptions could be defined.
For example if a power supply has its fans in read only mode, what should be the result of set_fan_speed.
return None | ||
|
||
@abc.abstractmethod | ||
def get_reboot_cause(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reboot cause can be persistent on some product and have to be explicitly cleared.
Otherwise reading this information after a reboot will show all the reload causes that happened previous times.
Some platform might also have limitation on the number of reboot causes that can be stored.
I would suggest either a default parameter clear=False
to this method or to add a new one called clear_reboot_cause
.
Also I think this function should only be called once with clear=True
and the result cached somewhere in a tmpfs. This means that future asks for reload cause information would go through the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reboot cause can be persistent on some product and have to be explicitly cleared.
Otherwise reading this information after a reboot will show all the reload causes that happened previous times.I would suggest either a default parameter
clear=False
to this method or to add a new one calledclear_reboot_cause
.
Also I think this function should only be called once withclear=True
and the result cached somewhere in a tmpfs. This means that future asks for reload cause information would go through the cache.
Since the need to clear the reboot cause after reading is platform-specific functionality, I believe this logic should be left to the platform-specific implementation of the function. If the platform requires clearing the reboot cause, it should be done in the implementation and a file saved to tmpfs. The function can check for the presence of the file first before attempting to read/clear the register. I prefer this to adding a clear=True
parameter to the base method, and I think adding a clear_reboot_cause()
method is unnecessary, because there should be no need to explicitly clear the reboot cause.
Some platform might also have limitation on the number of reboot causes that can be stored.
If you're concerned that not all platforms will support all reboot causes, that is expected. The base class will define all possible reboot causes that SONiC understands; the platform-specific implementation will simply return the subset which is applicable to that platform. If this doesn't address your concern, please elaborate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reboot cause can be persistent on some product and have to be explicitly cleared.
Otherwise reading this information after a reboot will show all the reload causes that happened previous times.
I would suggest either a default parameterclear=False
to this method or to add a new one calledclear_reboot_cause
.
Also I think this function should only be called once withclear=True
and the result cached somewhere in a tmpfs. This means that future asks for reload cause information would go through the cache.Since the need to clear the reboot cause after reading is platform-specific functionality, I believe this logic should be left to the platform-specific implementation of the function. If the platform requires clearing the reboot cause, it should be done in the implementation and a file saved to tmpfs. The function can check for the presence of the file first before attempting to read/clear the register. I prefer this to adding a
clear=True
parameter to the base method, and I think adding aclear_reboot_cause()
method is unnecessary, because there should be no need to explicitly clear the reboot cause.
I would assume all the platforms to behave the same way with regard to the reboot causes. However this information would have to be crowd-sourced from other vendors as well for more relevant data.
I'm fine with letting the responsibility of clearing and caching the reboot cause to the plugin implementation.
Is there a place in SONiC where such cache files should go? Currently almost everything is backed by the flash including /tmp
. The only place that is a tmpfs
and would feel "suitable" seems to be /run
. Is there some guidelines somewhere regarding this? I would be interested to know the reason why /tmp
is not a tmpfs
.
Some platform might also have limitation on the number of reboot causes that can be stored.
If you're concerned that not all platforms will support all reboot causes, that is expected. The base class will define all possible reboot causes that SONiC understands; the platform-specific implementation will simply return the subset which is applicable to that platform. If this doesn't address your concern, please elaborate.
My concern was a bit different. Your previous answer ended up answered my question but I'll elaborate a bit more.
Basically the concept of a reboot cause is to be stored on an external device until being read.
These devices usually have space restriction and can only save a given number of reboot cause after which they won't store anything until being cleared. For example if your device can store 2 reboot cause and you trigger 2 watchdogs and then 1 powerloss, you might not see the powerloss if you haven't cleared the reboot cause after each reboot.
return 0 | ||
|
||
@abc.abstractmethod | ||
def get_speed_tolerance(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming this is only for the fan daemon logic usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is the initial intention. We may find other uses in the future, however. Is there a concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No concerns. I'm was just curious if the fand was in the roadmap.
sonic_platform_base/fan_base.py
Outdated
""" | ||
return 0 | ||
|
||
@abc.abstractmethod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this method have to be abstract or a default implementation that is always overridable would do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming tolerance is most commonly implemented as a static attribute, we could define a tolerance
member and have that returned here as the default implementation. Do you think that would suffice for most implementations?
Sets the fan speed | ||
|
||
Args: | ||
speed: An integer, the percentage of full fan speed to set fan to, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see everywhere that the fan speed is an int from 0 to 100.
Some device dealing with fans can have finer grained control (like 0 to 255) which means loosing a bit of precision.
Was this taken into consideration and integer deemed better than floats?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you prefer a 0.0 ... 1.0 float scale? This should allow for finer-grained control if necessary, while maintaining a more universal scale.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't mind either way, integer percentages 0 to 100 sounds good to me.
I just wanted to bring this topic up as the sysfs interface defines a pwm value to be ranging from 0 to 255.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all platforms used sysfs, the 0-255 scale would be fine. However, we'd like a universal scale that will work with other interfaces (IPMI, etc.).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have any strong preference. I just wanted to bring that topic to make sure it wasn't overlooked.
A percentage sounds perfect to me.
one general question, previously when add a new API to the base class, need to implement a stub to all sub class for each platform (as sonic-net/sonic-buildimage#2017 described), can this be avoided in the new implementation? |
…otImplementedError - We plan to implement a build-time or run-time check for all methods lacking platform-specific implemention and will output a list of them.
This has been address in: bd6b6ed. In lieu of an abstract base class, we will implement either a build-time or run-time check that will output all methods which are not implemented in order to alert vendors to methods they may have missed. |
…menting fan methods
# watchdog_base.py | ||
# | ||
# Abstract base class for implementing a platform-specific class with which | ||
# to interact with a hardware watchdog module in SONiC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how do we know if the platform has such capability or not? do we have query API for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid the need for additional functions with which to query device/capability availability, the current design is such that the querying is done by getting the device object. If the object is null (None), the device/capability is not available on the platform. For example:
#!/usr/bin/env python
import sonic_platform
platform = sonic_platform.platform.Platform()
chassis = platform.get_chassis()
if not chassis:
print("Error getting chassis!")
sys.exit(1)
watchdog = chassis.get_watchdog()
if not watchdog:
print("Watchdog not present!")
else:
watchdog.arm(10)
… device and will not have serial number, etc.
status='1' represents device inserted, | ||
status='0' represents device removed. Ex. {'0': '1', '1': '0'} | ||
""" | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wanted to get confirmation that if a platform takes a Baseboard Management Controller (BMC) based approach (to handle thermals, events etc), these (and the other such APIs) can remain unimplemented..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention here is for daemons to receive notifications of changes in platform hardware state in order to update SONiC's view of device state. For instance, if a tranceiver, PSU or fan tray is added or removed. Can you provide an example of an event handled by the BMC which SONiC would not need to be made aware of?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that the events referred to is specifically to track the presence/absence i..e OIR of the modules (essentially this API seems to be a generalization of the get_transceiver_change_event()).
If the BMC implements the thermals, an addition/removal of FAN tray would be completely handled by the BMC itself - but atleast for generating a syslog this API needs to be implemented for SONiC..
New platform API is installed as a Python package, "sonic_platform_base" which consists of abstract base classes for each type of platform peripheral device
New API is object oriented and is structured in a hierarchy based on the relationship of devices. Root class is "platform" from which one can interact with all available chassis on the platform. From each chassis object, one can interact with all available fan modules, PSUs, SFP transceivers, etc. available on that chassis, etc.
Platform vendors will create a concrete implementation of this package, resulting in a platform-specific "sonic_platform" package, one per platform. The method of implementation will be very similar to the current "plugin" based model, but instead of creating individual plugins, the vendor will create an entire Python package which will be loaded by applications.
This commit provides abstract base classes for platform, chassis, fan modules and PSUs. Classes for other devices will be added incrementally. This package will get installed alongside the old platform API packages until it is complete and all platforms have implemented it. At that point in time, we can remove the old packages.