-
Notifications
You must be signed in to change notification settings - Fork 190
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(baremetal_validator): Add Validator for Process, Container Metrics #1878
base: main
Are you sure you want to change the base?
feat(baremetal_validator): Add Validator for Process, Container Metrics #1878
Conversation
…etal Metrics This commit includes validation setups for Process and Container RAPL related power metrics (node, process, container). Signed-off-by: Kaiyi <[email protected]>
🤖 SeineSailor Here's a concise summary of the pull request changes: Summary: This pull request introduces a baremetal validator for process and container RAPL-related power metrics, adding a new Bash script and several Python classes for stress testing. Key modifications include:
Impact: These changes may affect the external interface and behavior of the code, particularly in the areas of stress testing and configuration data handling. Observations/Suggestions:
Overall, this pull request introduces significant new functionality for baremetal validations, but it's essential to thoroughly review and test these changes to ensure they do not introduce unintended consequences or break existing functionality. |
Added bload to load configuration classes for process, container, and node level baremetal validation. Signed-off-by: Kaiyi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we modify the existing stressor script instead of adding new ones?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not want to modify the other scripts because it might break the metal ci. I think this script is the best to replace the other stressor scripts (as it enables a custom load curve). A future PR should be made for modifying validator to have just one goto stressor script.
node = Local( | ||
load_curve=node_config.get("load_curve", default_config["load_curve"]), | ||
iterations=node_config.get("iterations", default_config["iterations"]), | ||
mount_dir=os.path.expanduser(node_config.get("mount_dir", default_config["mount_dir"])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to mount the directory? Also how are you planning to deploy Kepler on BM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, kepler is deployed with docker compose for simplicity. however, this should still work the same even in a kubernetes environment. The mount directory is specifically to get the start and end times of the stressor script as a log file from the container. In a non container, the log file will be saved at the mount directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we target a single environment for now and focus on expanding it further? Currently, we use docker compose for our development and validations. Let's stick with that for now and afterwards add support for k8s
Wdyt?
process_config = {} | ||
process = LocalProcess( | ||
isolated_cpu=process_config.get("isolated_cpu", default_config["isolated_cpu"]), | ||
load_curve=process_config.get("load_curve", default_config["load_curve"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference in the load curve between the process and the container?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No difference. the logic is the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why we require separate configs for both?
> "$TIME_INTERVAL_LOG" | ||
|
||
start_time=$(date +%s) | ||
echo "Stress Start Time: $start_time" >> "$TIME_INTERVAL_LOG" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIK we do something similar in our existing validator source by checking the start and end time of the stressor.sh
script. I think we can do similar in this case as well instead of computing start and end time in the bash script itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason this is necessary is primarily for container validation and in future if we attempt pods (which should fall under container still anyway). Container setup requires an image installation (in this case, fedora and stressng need to be installed). If we get start and end time outside we will include the time for installing which will skew the results. The easiest solution I can think of was to actually compute it in the bash script so we do not include the installation time.
Added process, container, node/local validation to the cli. Signed-off-by: Kaiyi <[email protected]>
Added relevant config files for baremetal validation including formatted prom validation metrics and baremetal configuration. Signed-off-by: Kaiyi <[email protected]>
Added most important component rapl metrics to validate on bm Signed-off-by: Kaiyi Liu <[email protected]>
|
||
config: # default settings | ||
isolated_cpu: "15" # Logical processor that is fully isolated from scheduler (ex. isolcpus) | ||
load_curve: "0:15,10:20,25:20,50:20,75:20,100:30,75:20,50:20,25:20,10:20,0:15" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't we add a new load curve in the existing stressor and not make it part of the config as it overcomplicated the config part How's this different from what default load curve gives us?
This commit includes validation setups for Process and Container RAPL related Baremetal power metrics (node, process, container).