Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(baremetal_validator): Add Validator for Process, Container Metrics #1878

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

KaiyiLiu1234
Copy link
Collaborator

This commit includes validation setups for Process and Container RAPL related Baremetal power metrics (node, process, container).

…etal Metrics

This commit includes validation setups for Process and Container RAPL related power metrics
(node, process, container).

Signed-off-by: Kaiyi <[email protected]>
@KaiyiLiu1234 KaiyiLiu1234 marked this pull request as draft December 3, 2024 00:46
Copy link
Contributor

github-actions bot commented Dec 3, 2024

🤖 SeineSailor

Here's a concise summary of the pull request changes:

Summary: This pull request introduces a baremetal validator for process and container RAPL-related power metrics, adding a new Bash script and several Python classes for stress testing. Key modifications include:

  • New baremetal_validator script and Python classes (ProcessOutput, ContainerOutput, StresserError, Local, Process, and Container) for process and container stress testing.
  • Introduction of new named tuples (BMValidator, Local, LocalProcess, LocalContainer, LocalPrometheus) for configuration data.
  • Changes to the validate_acpi command to accept a duration argument and addition of a new regression command to the bm_validator group.
  • Updates to the bload function to parse and initialize new fields from the YAML configuration file, with an updated configuration file format for baremetal validations.

Impact: These changes may affect the external interface and behavior of the code, particularly in the areas of stress testing and configuration data handling.

Observations/Suggestions:

  • It would be beneficial to include unit tests for the new Python classes and Bash script to ensure their correctness and robustness.
  • Consider adding documentation for the new regression command and its usage.
  • The updated configuration file format may require changes to existing integrations or scripts that rely on the old format; ensure that these changes are properly communicated and addressed.

Overall, this pull request introduces significant new functionality for baremetal validations, but it's essential to thoroughly review and test these changes to ensure they do not introduce unintended consequences or break existing functionality.

Added bload to load configuration classes for process, container,
and node level baremetal validation.

Signed-off-by: Kaiyi <[email protected]>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we modify the existing stressor script instead of adding new ones?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not want to modify the other scripts because it might break the metal ci. I think this script is the best to replace the other stressor scripts (as it enables a custom load curve). A future PR should be made for modifying validator to have just one goto stressor script.

node = Local(
load_curve=node_config.get("load_curve", default_config["load_curve"]),
iterations=node_config.get("iterations", default_config["iterations"]),
mount_dir=os.path.expanduser(node_config.get("mount_dir", default_config["mount_dir"]))
Copy link
Collaborator

@vprashar2929 vprashar2929 Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to mount the directory? Also how are you planning to deploy Kepler on BM?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, kepler is deployed with docker compose for simplicity. however, this should still work the same even in a kubernetes environment. The mount directory is specifically to get the start and end times of the stressor script as a log file from the container. In a non container, the log file will be saved at the mount directory.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we target a single environment for now and focus on expanding it further? Currently, we use docker compose for our development and validations. Let's stick with that for now and afterwards add support for k8s
Wdyt?

process_config = {}
process = LocalProcess(
isolated_cpu=process_config.get("isolated_cpu", default_config["isolated_cpu"]),
load_curve=process_config.get("load_curve", default_config["load_curve"]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference in the load curve between the process and the container?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No difference. the logic is the same

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then why we require separate configs for both?

> "$TIME_INTERVAL_LOG"

start_time=$(date +%s)
echo "Stress Start Time: $start_time" >> "$TIME_INTERVAL_LOG"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK we do something similar in our existing validator source by checking the start and end time of the stressor.sh script. I think we can do similar in this case as well instead of computing start and end time in the bash script itself

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason this is necessary is primarily for container validation and in future if we attempt pods (which should fall under container still anyway). Container setup requires an image installation (in this case, fedora and stressng need to be installed). If we get start and end time outside we will include the time for installing which will skew the results. The easiest solution I can think of was to actually compute it in the bash script so we do not include the installation time.

KaiyiLiu1234 and others added 3 commits December 9, 2024 22:11
Added process, container, node/local validation to the cli.

Signed-off-by: Kaiyi <[email protected]>
Added relevant config files for baremetal validation including
formatted prom validation metrics and baremetal configuration.

Signed-off-by: Kaiyi <[email protected]>
Added most important component rapl metrics to validate on bm

Signed-off-by: Kaiyi Liu <[email protected]>
@KaiyiLiu1234 KaiyiLiu1234 marked this pull request as ready for review December 16, 2024 04:04

config: # default settings
isolated_cpu: "15" # Logical processor that is fully isolated from scheduler (ex. isolcpus)
load_curve: "0:15,10:20,25:20,50:20,75:20,100:30,75:20,50:20,25:20,10:20,0:15"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't we add a new load curve in the existing stressor and not make it part of the config as it overcomplicated the config part How's this different from what default load curve gives us?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants