Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[testbed] Include CPU and memory limits to benchmark results file #36753

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

bacherfl
Copy link
Contributor

Description

This PR adds the specified limits for the CPU and memory consumption to the benchmark results file. This way, these values can also be included in a timeseries chart, next to the max and average consumption.

For now, this is disabled by default in the PerfTestValidator to not introduce any unwanted changes to the charts generated from tests using this component

Link to tracking issue

Fixes #36720

@cforce
Copy link

cforce commented Dec 11, 2024

@bacherfl Where can i find the results for the test execution including the cpu metrics ?

@bacherfl
Copy link
Contributor Author

@bacherfl Where can i find the results for the test execution including the cpu metrics ?

It seems like the load tests are not executed on PRs, but only on main. But an example of where they would show up is in the overview at the end of the test execution, e.g. here: https://github.com/open-telemetry/opentelemetry-collector-contrib/actions/runs/12293184933/job/34306235893#step:10:171
Also, the tesbed generates a benchmarks.json file that contains those values, which looks like the following:

[
  {
    "name": "cpu_percentage_avg",
    "unit": "%",
    "value": 6.7290864162868,
    "extra": "Metric10kDPS/OTLP - Cpu Percentage"
  },
  {
    "name": "cpu_percentage_max",
    "unit": "%",
    "value": 7.626871029271405,
    "extra": "Metric10kDPS/OTLP - Cpu Percentage"
  },
  {
    "name": "cpu_percentage_limit",
    "unit": "%",
    "value": 60,
    "extra": "Metric10kDPS/OTLP - Cpu Percentage"
  },
  {
    "name": "ram_mib_avg",
    "unit": "MiB",
    "value": 69,
    "extra": "Metric10kDPS/OTLP - RAM (MiB)"
  },
  {
    "name": "ram_mib_max",
    "unit": "MiB",
    "value": 98,
    "extra": "Metric10kDPS/OTLP - RAM (MiB)"
  },
  {
    "name": "ram_mib_limit",
    "unit": "MiB",
    "value": 125,
    "extra": "Metric10kDPS/OTLP - RAM (MiB)"
  },
  {
    "name": "dropped_span_count",
    "unit": "spans",
    "value": 0,
    "extra": "Metric10kDPS/OTLP - Dropped Span Count"
  }
]

The entries for cpu_percentage_limit and memory_percentage_limit would be added with this PR

@bacherfl bacherfl marked this pull request as ready for review December 13, 2024 06:05
@bacherfl bacherfl requested a review from a team as a code owner December 13, 2024 06:05
@bacherfl bacherfl requested a review from ChrsMark December 13, 2024 06:05
Copy link
Contributor

github-actions bot commented Jan 3, 2025

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Jan 3, 2025
@bacherfl bacherfl removed the Stale label Jan 7, 2025
@cforce
Copy link

cforce commented Jan 13, 2025

The entries for cpu_percentage_limit and memory_percentage_limit would be added with this PR

It would be great if these results as part of the ci run are plotted as time series graph for some at least 3 releases (several weeks of builds) so its easy to visualize the trend.

eg in python

import json
import os
import matplotlib.pyplot as plt
import pandas as pd
from datetime import datetime, timedelta

# Directory containing benchmark JSON files
directory = "benchmarks"

# Initialize an empty list to store data from all files
all_data = []
start_time = datetime.now()

# Read all JSON files in the directory
for idx, filename in enumerate(os.listdir(directory)):
    if filename.endswith(".json"):  # Only process JSON files
        filepath = os.path.join(directory, filename)
        with open(filepath, "r") as file:
            metrics = json.load(file)
            # Add a timestamp for each file
            timestamps = [start_time + timedelta(minutes=idx * 10 + i) for i in range(len(metrics))]
            for metric, timestamp in zip(metrics, timestamps):
                metric["timestamp"] = timestamp
            all_data.extend(metrics)

# Create a DataFrame from the aggregated data
df = pd.DataFrame(all_data)

# Plotting
plt.figure(figsize=(12, 8))

# Group and plot by unit
for unit in df['unit'].unique():
    subset = df[df['unit'] == unit]
    plt.plot(subset['timestamp'], subset['value'], marker='o', label=f"{unit} Metrics")

# Customize plot
plt.title("Time Series of otel benchmark results")
plt.xlabel("Timestamp")
plt.ylabel("Metric Value")
plt.legend()
plt.grid(True)
plt.xticks(rotation=45)
plt.tight_layout()

# Save plot as an image
output_image = "time_series_graph.png"
plt.savefig(output_image, dpi=300)  # Save with high resolution
plt.close()  # Close the plot to free memory

print(f"Graph has been saved as '{output_image}'.")

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good! However I don't fully follow what is the end goal for this addition. Could we clarify that?

type PerfTestValidator struct{}
type PerfTestValidator struct {
// IncludeLimitsInReport determines whether the summary generated by PerformanceResults should include the specified resource limits
IncludeLimitsInReport bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a plan to actually use this at some point for the current tests? Or it's for the future ones?

@bacherfl
Copy link
Contributor Author

The change looks good! However I don't fully follow what is the end goal for this addition. Could we clarify that?

Sure! In our case at Dynatrace, we are publishing the load test results for our collector distro to a page to keep track of the memory and cpu consumption: https://dynatrace.github.io/dynatrace-otel-collector/benchmarks/loadtests/
Currently these charts only contain the actual consumption, but not the limits. By adding this feature it will become easier to see how close the consumption is to the limit, and when we had to increase or decrease the limits, e.g. after a new release.

This might also be useful for the benchmark charts published on the OTel docs page, but to avoid suddenly changing the graphs there by adding those new values, I opted to keep the reporting of those disabled by default via the IncludeLimitsInReport option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[testbed] Add resource limits to benchmark results file
5 participants