-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics for ec2 api calls made by CNI and expose via prometheus #2142
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good to me overall. Two high-level questions:
- How did you make sure that you did not miss any EC2 calls? Did you just grep on all of the EC2 wrapper functions?
- Do you think it makes sense to also count EC2 errors? I wonder if we should additionally count the errors per API call.
Also, can you run metrics-helper
integration tests against these changes for coverage?
Good to add some UTs and integration tests around this. |
Let's add some unit test cases in https://github.com/aws/amazon-vpc-cni-k8s/blob/master/cmd/cni-metrics-helper/metrics/metrics_test.go and an integration test case in https://github.com/aws/amazon-vpc-cni-k8s/blob/master/test/integration/metrics-helper/metric_helper_test.go . Let's also add a per-API error count like |
bc04a28
to
22afc6e
Compare
Added unit test and integration test for this change. Also added the new metric for tracking EC2 Api failure count |
22afc6e
to
4525da0
Compare
4525da0
to
23c2c28
Compare
awscni_ec2api_req_count{fn="AssignPrivateIpAddresses"} 1 | ||
awscni_ec2api_req_count{fn="AttachNetworkInterface"} 1 | ||
awscni_ec2api_req_count{fn="CreateNetworkInterface"} 1 | ||
awscni_ec2api_req_count{fn="DeleteNetworkInterface"} 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be create tags isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's added to be tracked. But it looks like when I took the logs that function was probably not hit until then. Do we want to explicitly check for that scenario ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nvm, we are incrementing the value.
* create publisher with logger (#2119) * Add missing rules when NodePort support is disabled (#2026) * Add missing rules when NodePort support is disabled * the rules that need to be installed for NodePort support and SNAT support are very similar. The same traffic mark is needed for both. As a result, rules that are currently installed only when NodePort support is enabled should also be installed when external SNAT is disabled, which is the case by default. * remove "-m state --state NEW" from a rule in the nat table. This is always true for packets that traverse the nat table. * fix typo in one rule's name (extra whitespace). Fixes #2025 Co-authored-by: Quan Tian <[email protected]> Signed-off-by: Antonin Bas <[email protected]> * Fix typos and unit tests Signed-off-by: Antonin Bas <[email protected]> * Minor improvement to code comment Signed-off-by: Antonin Bas <[email protected]> * Address review comments * Delete legacy nat rule * Fix an unrelated log message Signed-off-by: Antonin Bas <[email protected]> Signed-off-by: Antonin Bas <[email protected]> Co-authored-by: Jayanth Varavani <[email protected]> Co-authored-by: Sushmitha Ravikumar <[email protected]> * downgrade test go.mod to align with root go.mod (#2128) * skip addon installation when addon info is not available (#2131) * Merging test/Makefile and test/go.mod to the root Makefil and go.mod, adjust the .github/workflows and integration test instructions (#2129) * update troubleshooting docs for CNI image (#2132) fix location where make command is run * fix env name in test script (#2136) * optionally allow CLUSTER_ENDPOINT to be used rather than the cluster-ip (#2138) * optionally allow CLUSTER_ENDPOINT to be used rather than the kubernetes cluster ip * remove check for kube-proxy * add version to readme * Add resources config option to cni metrics helper (#2141) * Add resources config option to cni metrics helper * Remove default-empty resources block; replace with conditional * Add metrics for ec2 api calls made by CNI and expose via prometheus (#2142) Co-authored-by: Jay Deokar <[email protected]> * increase workflow role duration to 4 hours (#2148) * Update golang 1.19.2 EKS-D (#2147) * Update golang * Move to EKS distro builds * [HELM]: Move CRD resources to a separate folder as per helm standard (#2144) Co-authored-by: Jay Deokar <[email protected]> * VPC-CNI minimal image builds (#2146) * VPC-CNI minimal image builds * update dependencies for ginkgo when running integration tests * address review comments and break up init main function * review comments for sysctl * Simplify binary installation, fix review comments Since init container is required to always run, let binary installation for external plugins happen in init container. This simplifies the main container entrypoint and the dockerfile for each image. * when IPAMD connection fails, try to teardown pod network using prevResult (#2145) * add env var to enable nftables (#2155) * fix failing weekly cron tests (#2154) * Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER and remove no-op setter (#2153) * Deprecate AWS_VPC_K8S_CNI_CONFIGURE_RPFILTER * update release version comments Signed-off-by: Antonin Bas <[email protected]> Co-authored-by: Jeffrey Nelson <[email protected]> Co-authored-by: Antonin Bas <[email protected]> Co-authored-by: Jayanth Varavani <[email protected]> Co-authored-by: Sushmitha Ravikumar <[email protected]> Co-authored-by: Jerry He <[email protected]> Co-authored-by: Brandon Wagner <[email protected]> Co-authored-by: Jonathan Ogilvie <[email protected]> Co-authored-by: Jay Deokar <[email protected]>
What type of PR is this? feature
Which issue does this PR fix: #2035
What does this PR do / Why do we need it:
We don't collect metrics about the number of calls made to EC2 APIs. This change adds the capability to track metrics about the number of calls made to EC2 API by IPAMD.
cni-metrics-helper
is configured to collect these metrics and push it to cloudwatchIf an issue # is not available please add repro steps and logs from IPAMD/CNI showing the issue:
N/A
Testing done on this change:
Yes this has been tested on EKS 1.23 cluster.
Automation added to e2e:
N?A
Will this PR introduce any new dependencies?:
N/A
Will this break upgrades or downgrades. Has updating a running cluster been tested?:
No breaking changes. Yes tested on EKS 1.23 cluster
Does this change require updates to the CNI daemonset config files to work?:
Yes, updated image names for CNI and cni metrics helper image tags
Does this PR introduce any user-facing change?:
Yes
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.