Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AWS] Fix ec2 host.cpu.usage + include instance metadata and rate metrics for linked monitoring accounts #35717

Merged
merged 12 commits into from
Jun 19, 2023

Conversation

kaiyan-sheng
Copy link
Contributor

@kaiyan-sheng kaiyan-sheng commented Jun 8, 2023

What does this PR do?

This PR is fixing two bugs:

  1. calculating host.cpu.usage metric for EC2: apparently it is looking for the wrong metric name. It should be aws.ec2.metrics.CPUUtilization.avg.
  2. addHostFields and calculateRate functions are getting skipped when collecting metrics from linked accounts: because when adding metadata, we are making an additional API call to list all EC2 instances in that account and use that list/return value to add metadata. During this process, we only add metadata to instances from the monitoring account because DescribeInstances does not have access to linked source accounts. This case, we should move addHostFields and calculateRate functions to the beginning when adding meta data. This way, both monitoring account and linked source accounts instances will have host fields and rate. The actual metadata can be added later just for the instances in monitoring account.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

@kaiyan-sheng kaiyan-sheng requested a review from a team as a code owner June 8, 2023 01:48
@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jun 8, 2023
@mergify
Copy link
Contributor

mergify bot commented Jun 8, 2023

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @kaiyan-sheng? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 8, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-06-15T19:17:46.561+0000

  • Duration: 74 min 10 sec

Test stats 🧪

Test Results
Failed 0
Passed 1490
Skipped 110
Total 1600

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@kaiyan-sheng kaiyan-sheng added Team:Cloud-Monitoring Label for the Cloud Monitoring team backport-v8.8.0 Automated backport with mergify labels Jun 8, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 8, 2023
@kaiyan-sheng kaiyan-sheng added needs_team Indicates that the issue/PR needs a Team:* label backport-v8.7.0 Automated backport with mergify labels Jun 8, 2023
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jun 8, 2023
@tommyers-elastic
Copy link
Contributor

thanks kaiyan! why does moving addHostFields and calculateRate to the top of the loop have an effect on linked account instances? the instances still wont appear in the list returned from getInstancesPerRegion?

@kaiyan-sheng
Copy link
Contributor Author

@tommyers-elastic WDYT about adding the value from dynamic label ${PROP('Period')} into aws.cloudwatch.period field? Is it worth keeping the value? I added it to help calculate the rate metrics for EC2.

@mergify
Copy link
Contributor

mergify bot commented Jun 14, 2023

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b fix_ec2_cpu upstream/fix_ec2_cpu
git merge upstream/main
git push upstream fix_ec2_cpu

}

// add instance ID from dimension value
dimInstanceID, _ := events[eventIdentifier].RootFields.GetValue("aws.dimensions.InstanceId")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this always exist in the dimensions? should we check the error here just in case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good point! Will add it!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did some testing on this, when aws.dimensions.InstanceId doesn't exist, dimInstanceID returns as nil. Then events[eventIdentifier].RootFields.Put("cloud.instance.id", dimInstanceID) will just put nil as the value for cloud.instance.id which won't show up in the event.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬 that wouldn't be ideal - i wonder if we could fall back to the instance ID we from the loop var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the error check here to make sure only put value to cloud.instance.id when aws.dimensions.InstanceId exists. We can't use instance ID from the inner for loop var because that for loop is going through all the instances metadata from DescribeInstances API. We are trying to use the two for loops to match events with the dimension InstanceId to the same ID from DescribeInstances.

@tommyers-elastic tommyers-elastic self-requested a review June 15, 2023 09:52
@tommyers-elastic
Copy link
Contributor

@kaiyan-sheng if the period isn't used for anything in kibana, i don't think we should add it. we can always add it later if we find a need.

since we now have period and account name as dyamic labels, do we have any idea if this adds any additional overhead to the getmetricdata calls?

@tommyers-elastic tommyers-elastic changed the title [AWS] Fix ec2 host.cpu.usage [AWS] Fix ec2 host.cpu.usage + include instance metadata and rate metrics for linked monitoring accounts Jun 15, 2023
@kaiyan-sheng
Copy link
Contributor Author

kaiyan-sheng commented Jun 15, 2023

since we now have period and account name as dyamic labels, do we have any idea if this adds any additional overhead to the getmetricdata calls?

@tommyers-elastic Good point! I made a bash script and ran the AWS GetMetricData CLI call several times for both with and without getting the dynamic labels. The average time spent on GetMetricData is about the same.

for example some data points look like this:

Average Time without dynamic labels: 1040 ms
Average Time with dynamic label accountID and period: 1033 ms

and

Average Time without dynamic labels: 1036 ms
Average Time with dynamic label accountID and period: 1043 ms

By looking at several sets of output, I think we can say adding the dynamic label doesn't add overhead on the API call.

@kaiyan-sheng kaiyan-sheng merged commit 36aa884 into elastic:main Jun 19, 2023
@kaiyan-sheng kaiyan-sheng deleted the fix_ec2_cpu branch June 19, 2023 14:16
mergify bot pushed a commit that referenced this pull request Jun 19, 2023
…rics for linked monitoring accounts (#35717)

(cherry picked from commit 36aa884)
mergify bot pushed a commit that referenced this pull request Jun 19, 2023
…rics for linked monitoring accounts (#35717)

(cherry picked from commit 36aa884)
kaiyan-sheng added a commit that referenced this pull request Jun 20, 2023
…e metadata and rate metrics for linked monitoring accounts (#35816)

* [AWS] Fix ec2 host.cpu.usage + include instance metadata and rate metrics for linked monitoring accounts (#35717)

(cherry picked from commit 36aa884)
---------

Co-authored-by: kaiyan-sheng <[email protected]>
kaiyan-sheng added a commit that referenced this pull request Jun 20, 2023
…e metadata and rate metrics for linked monitoring accounts (#35817)

* [AWS] Fix ec2 host.cpu.usage + include instance metadata and rate metrics for linked monitoring accounts (#35717)

(cherry picked from commit 36aa884)

Co-authored-by: kaiyan-sheng <[email protected]>
@reakaleek reakaleek mentioned this pull request Jul 19, 2023
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v8.7.0 Automated backport with mergify backport-v8.8.0 Automated backport with mergify Team:Cloud-Monitoring Label for the Cloud Monitoring team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants