Skip to content
This repository has been archived by the owner on Oct 9, 2023. It is now read-only.

Add System Level Timeout #109

Closed
wants to merge 13 commits into from
Closed

Conversation

migueltol22
Copy link
Contributor

TL;DR

Handles system level timeouts in 2 cases, PodQueued and PhaseInitializing.

PodQueued is part of interruptible, where if a pod is stuck in pending for a time longer than the maxSystemLevelTimeout, we can assume that there are not enough resources in the ASG and we will then error out as a system level error and retry on the non-interruptible ASG.

PhaseInitializing is when some sort of system level error occurs that leaves us in this state. We will error with a system level error and retry launching the pod.

Type

  • Bug Fix
  • Feature
  • Plugin

Are all requirements met?

  • Code completed
  • Smoke tested
  • Unit tests added
  • Code documentation added
  • Any pending items have an associated Issue

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

https://github.com/lyft/flyte/issues/

Follow-up issue

NA
OR
_flyteorg/flyte#382

@codecov-commenter
Copy link

codecov-commenter commented Aug 5, 2020

Codecov Report

Merging #109 into master will increase coverage by 2.84%.
The diff coverage is 26.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #109      +/-   ##
==========================================
+ Coverage   55.54%   58.39%   +2.84%     
==========================================
  Files          99       99              
  Lines        5914     6225     +311     
==========================================
+ Hits         3285     3635     +350     
+ Misses       2318     2229      -89     
- Partials      311      361      +50     
Impacted Files Coverage Δ
go/tasks/pluginmachinery/flytek8s/config/config.go 50.00% <ø> (ø)
go/tasks/plugins/array/k8s/monitor.go 57.30% <0.00%> (ø)
go/tasks/plugins/k8s/sidecar/sidecar.go 75.72% <0.00%> (ø)
go/tasks/pluginmachinery/flytek8s/pod_helper.go 60.30% <22.72%> (+0.93%) ⬆️
...machinery/flytek8s/config/k8spluginconfig_flags.go 54.83% <100.00%> (+1.50%) ⬆️
go/tasks/plugins/k8s/container/container.go 80.48% <100.00%> (ø)
go/tasks/plugins/k8s/sagemaker/config/config.go 0.00% <0.00%> (ø)
...asks/pluginmachinery/flytek8s/k8s_resource_adds.go 95.55% <0.00%> (+0.20%) ⬆️
...o/tasks/pluginmachinery/ioutils/raw_output_path.go 82.25% <0.00%> (+2.94%) ⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85ef53c...6969027. Read the comment docs.

go/tasks/pluginmachinery/flytek8s/config/config.go Outdated Show resolved Hide resolved
// If the pod is interruptible and is waiting to be scheduled for an extended amount of time, it is possible there are
// no spot instances availabled in the AZ. In this case, we timeout with a system level error and will retry on a
// non spot instance AZ.
if val, ok := pod.ObjectMeta.Labels["interruptible"]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the interruptible string into a const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

for _, c := range status.Conditions {
switch c.Type {
case v1.PodScheduled:
if c.Status == v1.ConditionFalse {
// If the pod is interruptible and is waiting to be scheduled for an extended amount of time, it is possible there are
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ConditionFalse I think means this condition is no longer the "current" but it'll never be removed from the list of conditions. I do not think we should apply the timeout in this case...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep took a look at the docs and it appears it means that the condition is not applicable. Removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what if we do not have any system level timeouts? Can there be a -1 - escape hatchh?

@@ -153,6 +180,17 @@ func DemystifyPending(status v1.PodStatus) (pluginsCore.PhaseInfo, error) {
finalMessage := fmt.Sprintf("%s|%s", c.Message, containerStatus.State.Waiting.Message)
switch reason {
case "ErrImagePull", "ContainerCreating", "PodInitializing":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anand mentioned there are a couple more errors we should add here...
@anandswaminathan can you remind me?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can follow up, should be easy to add

@migueltol22 migueltol22 requested a review from EngHabu August 6, 2020 20:54
// If the pod is interruptible and is waiting to be scheduled for an extended amount of time, it is possible there are
// no spot instances availabled in the AZ. In this case, we timeout with a system level error and will retry on a
// non spot instance AZ.
if val, ok := pod.ObjectMeta.Labels[Interruptible]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a check here for if c.Status == v1.ConditionTrue or is Unschedulable a terminal condition that once entered, the pod will never exit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this comment for Unschedulable

// PodReasonUnschedulable reason in PodScheduled PodCondition means that the scheduler
// can't schedule the pod right now, for example due to insufficient resources in the cluster

However for the status that is being checked is the ConditionStatus and if it is in ConditionTrue then appears to mean that resource is in that condition.

// These are valid condition statuses. "ConditionTrue" means a resource is in the condition.
// "ConditionFalse" means a resource is not in the condition. "ConditionUnknown" means kubernetes
// can't decide if a resource is in the condition or not. In the future, we could add other
// intermediate conditions, e.g. ConditionDegraded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm in that case wouldn't we want the check for "ConditionTrue"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the check.

// To help mitigate the pod being stuck in this state we have a system level timeout that will error out
// as a system error and retry launching the pod.
if timeout > 0 && timeout > time.Since(status.StartTime.Time) {
return pluginsCore.PhaseInfoRetryableFailure(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anandswaminathan
Copy link
Contributor

cc @kumare3 as haytham is out.

// non spot instance AZ.
if val, ok := pod.ObjectMeta.Labels["interruptible"]; ok {
if val == "true" && k8s.GetConfig().MaxSystemLevelTimeout > 0 && k8s.GetConfig().MaxSystemLevelTimeout > elapsedtime.Minutes() {
return pluginsCore.PhaseInfoFailed(c.LastTransitionTime.Time, &idlCore.ExecutionError_SYSTEM{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use the system failure method

// non spot instance AZ.
if val, ok := pod.ObjectMeta.Labels["interruptible"]; ok {
if val == "true" && k8s.GetConfig().MaxSystemLevelTimeout > 0 && k8s.GetConfig().MaxSystemLevelTimeout > elapsedtime.Minutes() {
return pluginsCore.PhaseInfoFailed(c.LastTransitionTime.Time, &idlCore.ExecutionError_SYSTEM{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohh i see, you want to add the time here?

@@ -77,7 +77,7 @@ type K8sPluginConfig struct {
// Flyte CoPilot Configuration
CoPilot FlyteCoPilotConfig `json:"co-pilot" pflag:",Co-Pilot Configuration"`
// Set system level timeout. If timeout reached pod will be relaunched.
MaxSystemLevelTimeout config2.Duration `json:"maxSystemLevelTimeout" pflag:"-,Value to be used for system level timeouts in minutes."`
MaxSystemLevelTimeout config2.Duration `json:"max-system-level-timeout" pflag:"-,Value to be used for system level timeouts in minutes."`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we improve the description. I do not follow the description actually

@kumare3
Copy link
Contributor

kumare3 commented Mar 17, 2021

@migueltol22 are we reviving this?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants