Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics for ec2 api calls made by CNI and expose via prometheus #2142

Merged
merged 1 commit into from
Nov 21, 2022

Conversation

jaydeokar
Copy link
Contributor

@jaydeokar jaydeokar commented Nov 15, 2022

What type of PR is this? feature

Which issue does this PR fix: #2035

What does this PR do / Why do we need it:
We don't collect metrics about the number of calls made to EC2 APIs. This change adds the capability to track metrics about the number of calls made to EC2 API by IPAMD. cni-metrics-helper is configured to collect these metrics and push it to cloudwatch

If an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
N/A

Testing done on this change:
Yes this has been tested on EKS 1.23 cluster.

# HELP awscni_ec2api_req_count The number of requests made to EC2 APIs by CNI
# TYPE awscni_ec2api_req_count counter
awscni_ec2api_req_count{fn="AssignPrivateIpAddresses"} 1
awscni_ec2api_req_count{fn="AttachNetworkInterface"} 1
awscni_ec2api_req_count{fn="CreateNetworkInterface"} 1
awscni_ec2api_req_count{fn="DeleteNetworkInterface"} 1
awscni_ec2api_req_count{fn="DescribeInstances"} 1
awscni_ec2api_req_count{fn="DescribeNetworkInterfaces"} 13
awscni_ec2api_req_count{fn="DetachNetworkInterface"} 1
awscni_ec2api_req_count{fn="ModifyNetworkInterfaceAttribute"} 2

Automation added to e2e:
N?A

Will this PR introduce any new dependencies?:
N/A

Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No breaking changes. Yes tested on EKS 1.23 cluster

Does this change require updates to the CNI daemonset config files to work?:
Yes, updated image names for CNI and cni metrics helper image tags

Does this PR introduce any user-facing change?:
Yes

- CNI metrics helper can now report the number of EC2 API calls and number of failed calls and visualize it in cloud watch

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@jaydeokar jaydeokar requested a review from a team as a code owner November 15, 2022 20:59
@jdn5126 jdn5126 self-assigned this Nov 15, 2022
Copy link
Contributor

@jdn5126 jdn5126 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good to me overall. Two high-level questions:

  1. How did you make sure that you did not miss any EC2 calls? Did you just grep on all of the EC2 wrapper functions?
  2. Do you think it makes sense to also count EC2 errors? I wonder if we should additionally count the errors per API call.

Also, can you run metrics-helper integration tests against these changes for coverage?

@jayanthvn
Copy link
Contributor

Good to add some UTs and integration tests around this.

@jdn5126
Copy link
Contributor

jdn5126 commented Nov 15, 2022

@jaydeokar
Copy link
Contributor Author

Added unit test and integration test for this change. Also added the new metric for tracking EC2 Api failure count ec2ApiErrCount.

awscni_ec2api_req_count{fn="AssignPrivateIpAddresses"} 1
awscni_ec2api_req_count{fn="AttachNetworkInterface"} 1
awscni_ec2api_req_count{fn="CreateNetworkInterface"} 1
awscni_ec2api_req_count{fn="DeleteNetworkInterface"} 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be create tags isn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's added to be tracked. But it looks like when I took the logs that function was probably not hit until then. Do we want to explicitly check for that scenario ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm, we are incrementing the value.

@jdn5126 jdn5126 merged commit 906b6a4 into aws:master Nov 21, 2022
@jayanthvn jayanthvn added this to the v1.12.1 milestone Nov 23, 2022
jdn5126 added a commit that referenced this pull request Dec 12, 2022
* create publisher with logger (#2119)

* Add missing rules when NodePort support is disabled (#2026)

* Add missing rules when NodePort support is disabled

* the rules that need to be installed for NodePort support and SNAT
  support are very similar. The same traffic mark is needed for both. As
  a result, rules that are currently installed only when NodePort
  support is enabled should also be installed when external SNAT is
  disabled, which is the case by default.
* remove "-m state --state NEW" from a rule in the nat table. This is
  always true for packets that traverse the nat table.
* fix typo in one rule's name (extra whitespace).

Fixes #2025

Co-authored-by: Quan Tian <[email protected]>

Signed-off-by: Antonin Bas <[email protected]>

* Fix typos and unit tests

Signed-off-by: Antonin Bas <[email protected]>

* Minor improvement to code comment

Signed-off-by: Antonin Bas <[email protected]>

* Address review comments

* Delete legacy nat rule
* Fix an unrelated log message

Signed-off-by: Antonin Bas <[email protected]>

Signed-off-by: Antonin Bas <[email protected]>
Co-authored-by: Jayanth Varavani <[email protected]>
Co-authored-by: Sushmitha Ravikumar <[email protected]>

* downgrade test go.mod to align with root go.mod (#2128)

* skip addon installation when addon info is not available (#2131)

* Merging test/Makefile and test/go.mod to the root Makefil and go.mod, adjust the .github/workflows and integration test instructions (#2129)

* update troubleshooting docs for CNI image (#2132)

fix location where make command is run

* fix env name in test script (#2136)

* optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip (#2138)

* optionally allow CLUSTER_ENDPOINT to be used rather than the kubernetes cluster ip

* remove check for kube-proxy

* add version to readme

* Add resources config option to cni metrics helper (#2141)

* Add resources config option to cni metrics helper

* Remove default-empty resources block; replace with conditional

* Add metrics for ec2 api calls made by CNI and expose via prometheus (#2142)

Co-authored-by: Jay Deokar <[email protected]>

* increase workflow role duration to 4 hours (#2148)

* Update golang 1.19.2 EKS-D (#2147)

* Update golang

* Move to EKS distro builds

* [HELM]: Move CRD resources to a separate folder as per helm standard (#2144)

Co-authored-by: Jay Deokar <[email protected]>

* VPC-CNI minimal image builds (#2146)

* VPC-CNI minimal image builds

* update dependencies for ginkgo when running integration tests

* address review comments and break up init main function

* review comments for sysctl

* Simplify binary installation, fix review comments

Since init container is required to always run, let binary installation
for external plugins happen in init container. This simplifies the main
container entrypoint and the dockerfile for each image.

* when IPAMD connection fails, try to teardown pod network using prevResult (#2145)

* add env var to enable nftables (#2155)

* fix failing weekly cron tests (#2154)

* Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER and remove no-op setter (#2153)

* Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER

* update release version comments

Signed-off-by: Antonin Bas <[email protected]>
Co-authored-by: Jeffrey Nelson <[email protected]>
Co-authored-by: Antonin Bas <[email protected]>
Co-authored-by: Jayanth Varavani <[email protected]>
Co-authored-by: Sushmitha Ravikumar <[email protected]>
Co-authored-by: Jerry He <[email protected]>
Co-authored-by: Brandon Wagner <[email protected]>
Co-authored-by: Jonathan Ogilvie <[email protected]>
Co-authored-by: Jay Deokar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants