From ff9a1de9f6df6086762f02ee12e86740818838d6 Mon Sep 17 00:00:00 2001 From: Maciej Szulik Date: Thu, 9 Jun 2022 19:17:44 +0200 Subject: [PATCH 1/2] cronjob timezone to beta --- keps/prod-readiness/sig-apps/3140.yaml | 2 + .../README.md | 230 +++++------------- .../3140-TimeZone-support-in-CronJob/kep.yaml | 4 +- 3 files changed, 70 insertions(+), 166 deletions(-) diff --git a/keps/prod-readiness/sig-apps/3140.yaml b/keps/prod-readiness/sig-apps/3140.yaml index 7d2c3015ef0..d7d2dab9bb0 100644 --- a/keps/prod-readiness/sig-apps/3140.yaml +++ b/keps/prod-readiness/sig-apps/3140.yaml @@ -1,3 +1,5 @@ kep-number: 3140 alpha: approver: deads2k +beta: + approver: deads2k diff --git a/keps/sig-apps/3140-TimeZone-support-in-CronJob/README.md b/keps/sig-apps/3140-TimeZone-support-in-CronJob/README.md index 73ad19142ee..3aceef200ce 100644 --- a/keps/sig-apps/3140-TimeZone-support-in-CronJob/README.md +++ b/keps/sig-apps/3140-TimeZone-support-in-CronJob/README.md @@ -13,6 +13,10 @@ - [CronJob API](#cronjob-api) - [CronJob controller](#cronjob-controller) - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) - [Graduation Criteria](#graduation-criteria) - [Alpha](#alpha) - [Beta](#beta) @@ -39,17 +43,17 @@ Items marked with (R) are required *prior to targeting to a milestone / release* - [x] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) - [x] (R) KEP approvers have approved the KEP status as `implementable` - [x] (R) Design details are appropriately documented -- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) - - [ ] e2e Tests for all Beta API Operations (endpoints) - - [ ] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free -- [ ] (R) Graduation criteria is in place - - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [x] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [x] e2e Tests for all Beta API Operations (endpoints) + - [x] (R) Ensure GA e2e tests for meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [x] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [x] (R) Graduation criteria is in place + - [x] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) - [x] (R) Production readiness review completed - [x] (R) Production readiness review approved - [x] "Implementation History" section is up-to-date for milestone -- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] -- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes +- [x] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [x] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes - ###### How can a rollout or rollback fail? Can it impact already running workloads? - - An upgrade flow can be vulnerable to the enable, disable, enable if you have a lease that is acquired by a new kube-controller-manager, then an old kube-controller-manager, then a new kube-controller-manager. ###### What specific metrics should inform a rollback? - +Increased `cronjob_job_creation_skew` which tracks how much a job creation +is delayed compared to requested time slot. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? - +Upgrade->downgrade->upgrade path was manually tested. No issues were found during tests. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? - +No. ### Monitoring Requirements - - ###### How can an operator determine if the feature is in use by workloads? - +There's no explicit metric for TimeZone but operator should monitor `cronjob_job_creation_skew`, +ensuring the job creation skew is not increasing. ###### How can someone using this feature know that it is working for their instance? - - -- [ ] Events - - Event Reason: -- [ ] API .status - - Condition name: - - Other field: -- [ ] Other (treat as last resort) - - Details: +- [x] Events + - Event Reason: `UnknownTimeZone` when specified TimeZone is not correct ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? - +99th percentile of cron_job_creation_skew <= 5 seconds per cluster-day. ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? - - - [x] Metrics - Metric name: `cronjob_controller_rate_limiter_use` - Components exposing the metric: `kube-controller-manager` -- [ ] Other (treat as last resort) - - Details: + - Metric name: `cron_job_creation_skew` + - Components exposing the metric: `kube-controller-manager` + ###### Are there any missing metrics that would be useful to have to improve observability of this feature? - +No. ### Dependencies - - ###### Does this feature depend on any specific services running in the cluster? - +None. ### Scalability - - ###### Will enabling / using this feature result in any new API calls? No new API calls are expected. @@ -455,67 +382,42 @@ We're not using it, yet. ### Troubleshooting - - ###### How does this feature react if the API server and/or etcd is unavailable? ###### What are other known failure modes? - +- [Incorrect TimeZone] + - Detection: `UnknownTimeZone` events being reported for a CronJob. + - Mitigations: Fix the TimeZone or suspend a CronJob. + - Diagnostics: Logs containing `TimeZone` phrase. + - Testing: A set of unit tests is ensuring that invalid TimeZone is properly + handled both in the apiserver and in the controller itself, reporting to + user the problem. + ###### What steps should be taken if SLOs are not being met to determine the problem? +If possible increase the log level for kube-controller-manager and check cronjob's +controller logs looking for warnings and errors which might point where the problem +lies. + ## Implementation History - +- *2022-01-14* - Initial KEP draft +- *2022-06-09* - Updated KEP for beta promotion. ## Drawbacks - +Using TimeZone might be simpler for users working with a cluster in different +TimeZones, but adds additional complexity to the code and to the operator +who will need to re-calculate when an actual CronJob will be creating a Job +when `.spec.timeZone` is set. ## Alternatives Another approach was to specify time zone as an offset to UTC, but using the name instead seems more user friendly. - - ## Infrastructure Needed (Optional) - +None. diff --git a/keps/sig-apps/3140-TimeZone-support-in-CronJob/kep.yaml b/keps/sig-apps/3140-TimeZone-support-in-CronJob/kep.yaml index a5513ebb7b0..0083311b036 100644 --- a/keps/sig-apps/3140-TimeZone-support-in-CronJob/kep.yaml +++ b/keps/sig-apps/3140-TimeZone-support-in-CronJob/kep.yaml @@ -18,12 +18,12 @@ see-also: replaces: # The target maturity stage in the current dev cycle for this KEP. -stage: alpha +stage: beta # The most recent milestone for which work toward delivery of this KEP has been # done. This can be the current (upcoming) milestone, if it is being actively # worked on. -latest-milestone: "v1.24" +latest-milestone: "v1.25" # The milestone at which this feature was, or is targeted to be, at each stage. milestone: From 59553e3750e9034374db076bd59063da2198afdc Mon Sep 17 00:00:00 2001 From: Maciej Szulik Date: Mon, 13 Jun 2022 13:23:25 +0200 Subject: [PATCH 2/2] Address comments --- .../README.md | 32 ++++++++++++++++--- .../3140-TimeZone-support-in-CronJob/kep.yaml | 2 +- 2 files changed, 29 insertions(+), 5 deletions(-) diff --git a/keps/sig-apps/3140-TimeZone-support-in-CronJob/README.md b/keps/sig-apps/3140-TimeZone-support-in-CronJob/README.md index 3aceef200ce..0fd7ea817ee 100644 --- a/keps/sig-apps/3140-TimeZone-support-in-CronJob/README.md +++ b/keps/sig-apps/3140-TimeZone-support-in-CronJob/README.md @@ -328,7 +328,7 @@ ensuring the job creation skew is not increasing. ###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? -99th percentile of cron_job_creation_skew <= 5 seconds per cluster-day. +99th percentile over day for cron_job_creation_skew is <= 15s ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? @@ -347,7 +347,25 @@ No. ###### Does this feature depend on any specific services running in the cluster? -None. +CronJob's TimeZone support relies on external TimeZone package, if one is missing +golang's internal package will be used, instead. + +- kube-controller-manager and kube-apiserver + - Usage description: + Both kube-controller-manager and kube-apiserver need to have `CronJobTimeZone` + feature gate turned for this feature to fully work. + - Impact of its outage on the feature: + CronJob's TimeZone functionality will not work. + - Impact of its degraded performance or high-error rates on the feature: + Delays in creating new Jobs. + +- TimeZone package + - Usage description: CronJob's TimeZone support relies on external TimeZone package, + if one is missing golang's internal package will be used, instead. + - Impact of its outage on the feature: + TimeZone functionality will not work. + - Impact of its degraded performance or high-error rates on the feature: + Delays in creating new Jobs. ### Scalability @@ -386,14 +404,20 @@ We're not using it, yet. ###### What are other known failure modes? -- [Incorrect TimeZone] +- Incorrect TimeZone - Detection: `UnknownTimeZone` events being reported for a CronJob. - Mitigations: Fix the TimeZone or suspend a CronJob. - Diagnostics: Logs containing `TimeZone` phrase. - Testing: A set of unit tests is ensuring that invalid TimeZone is properly handled both in the apiserver and in the controller itself, reporting to user the problem. - +- Job creation problems + - Detection: `cron_job_creation_skew` metric is exceeding expected 15s per day. + - Mitigations: Disable `CronJobTimeZone` feature gate. + - Diagnostics: Check logs from CronJob controller. + - Testing: A set of unit tests is ensuring that invalid TimeZone is properly + handled both in the apiserver and in the controller itself, reporting to + user the problem. ###### What steps should be taken if SLOs are not being met to determine the problem? diff --git a/keps/sig-apps/3140-TimeZone-support-in-CronJob/kep.yaml b/keps/sig-apps/3140-TimeZone-support-in-CronJob/kep.yaml index 0083311b036..9db242e20fd 100644 --- a/keps/sig-apps/3140-TimeZone-support-in-CronJob/kep.yaml +++ b/keps/sig-apps/3140-TimeZone-support-in-CronJob/kep.yaml @@ -42,4 +42,4 @@ disable-supported: true # The following PRR answers are required at beta release metrics: -# - my_feature_metric + - cron_job_creation_skew