Add System Level Timeout #109

migueltol22 · 2020-08-05T01:34:20Z

TL;DR

Handles system level timeouts in 2 cases, PodQueued and PhaseInitializing.

PodQueued is part of interruptible, where if a pod is stuck in pending for a time longer than the maxSystemLevelTimeout, we can assume that there are not enough resources in the ASG and we will then error out as a system level error and retry on the non-interruptible ASG.

PhaseInitializing is when some sort of system level error occurs that leaves us in this state. We will error with a system level error and retry launching the pod.

Type

Bug Fix
Feature
Plugin

Are all requirements met?

Complete description

How did you fix the bug, make the feature etc. Link to any design docs etc

Tracking Issue

https://github.com/lyft/flyte/issues/

Follow-up issue

NA
OR
_flyteorg/flyte#382

codecov-commenter · 2020-08-05T16:09:50Z

Codecov Report

Merging #109 into master will increase coverage by 2.84%.
The diff coverage is 26.92%.

@@            Coverage Diff             @@
##           master     #109      +/-   ##
==========================================
+ Coverage   55.54%   58.39%   +2.84%     
==========================================
  Files          99       99              
  Lines        5914     6225     +311     
==========================================
+ Hits         3285     3635     +350     
+ Misses       2318     2229      -89     
- Partials      311      361      +50

Impacted Files	Coverage Δ
go/tasks/pluginmachinery/flytek8s/config/config.go	`50.00% <ø> (ø)`
go/tasks/plugins/array/k8s/monitor.go	`57.30% <0.00%> (ø)`
go/tasks/plugins/k8s/sidecar/sidecar.go	`75.72% <0.00%> (ø)`
go/tasks/pluginmachinery/flytek8s/pod_helper.go	`60.30% <22.72%> (+0.93%)`	⬆️
...machinery/flytek8s/config/k8spluginconfig_flags.go	`54.83% <100.00%> (+1.50%)`	⬆️
go/tasks/plugins/k8s/container/container.go	`80.48% <100.00%> (ø)`
go/tasks/plugins/k8s/sagemaker/config/config.go	`0.00% <0.00%> (ø)`
...asks/pluginmachinery/flytek8s/k8s_resource_adds.go	`95.55% <0.00%> (+0.20%)`	⬆️
...o/tasks/pluginmachinery/ioutils/raw_output_path.go	`82.25% <0.00%> (+2.94%)`	⬆️
... and 6 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 85ef53c...6969027. Read the comment docs.

go/tasks/pluginmachinery/flytek8s/config/config.go

EngHabu · 2020-08-05T23:22:21Z

go/tasks/pluginmachinery/flytek8s/pod_helper.go

+				// If the pod is interruptible and is waiting to be scheduled for an extended amount of time,  it is possible there are
+				// no spot instances availabled in the AZ. In this case, we timeout with a system level error and will retry on a
+				// non spot instance AZ.
+				if val, ok := pod.ObjectMeta.Labels["interruptible"]; ok {


Please move the interruptible string into a const

EngHabu · 2020-08-05T23:50:31Z

go/tasks/pluginmachinery/flytek8s/pod_helper.go

 	for _, c := range status.Conditions {
 		switch c.Type {
 		case v1.PodScheduled:
 			if c.Status == v1.ConditionFalse {
+				// If the pod is interruptible and is waiting to be scheduled for an extended amount of time,  it is possible there are


ConditionFalse I think means this condition is no longer the "current" but it'll never be removed from the list of conditions. I do not think we should apply the timeout in this case...

Yep took a look at the docs and it appears it means that the condition is not applicable. Removed.

So what if we do not have any system level timeouts? Can there be a -1 - escape hatchh?

EngHabu · 2020-08-05T23:51:21Z

go/tasks/pluginmachinery/flytek8s/pod_helper.go

@@ -153,6 +180,17 @@ func DemystifyPending(status v1.PodStatus) (pluginsCore.PhaseInfo, error) {
 							finalMessage := fmt.Sprintf("%s|%s", c.Message, containerStatus.State.Waiting.Message)
 							switch reason {
 							case "ErrImagePull", "ContainerCreating", "PodInitializing":


Anand mentioned there are a couple more errors we should add here...
@anandswaminathan can you remind me?

Yes. We can follow up, should be easy to add

Co-authored-by: Haytham AbuelFutuh <[email protected]>

EngHabu · 2020-08-06T20:56:40Z

go/tasks/pluginmachinery/flytek8s/pod_helper.go

+			// If the pod is interruptible and is waiting to be scheduled for an extended amount of time,  it is possible there are
+			// no spot instances availabled in the AZ. In this case, we timeout with a system level error and will retry on a
+			// non spot instance AZ.
+			if val, ok := pod.ObjectMeta.Labels[Interruptible]; ok {


Should we add a check here for if c.Status == v1.ConditionTrue or is Unschedulable a terminal condition that once entered, the pod will never exit?

I found this comment for Unschedulable

// PodReasonUnschedulable reason in PodScheduled PodCondition means that the scheduler // can't schedule the pod right now, for example due to insufficient resources in the cluster

However for the status that is being checked is the ConditionStatus and if it is in ConditionTrue then appears to mean that resource is in that condition.

// These are valid condition statuses. "ConditionTrue" means a resource is in the condition. // "ConditionFalse" means a resource is not in the condition. "ConditionUnknown" means kubernetes // can't decide if a resource is in the condition or not. In the future, we could add other // intermediate conditions, e.g. ConditionDegraded.

hm in that case wouldn't we want the check for "ConditionTrue"?

Added the check.

anandswaminathan · 2020-08-19T23:30:44Z

go/tasks/pluginmachinery/flytek8s/pod_helper.go

+								// To help mitigate the pod being stuck in this state we have a system level timeout that will error out
+								// as a system error and retry launching the pod.
+								if timeout > 0 && timeout > time.Since(status.StartTime.Time) {
+									return pluginsCore.PhaseInfoRetryableFailure(


Return system error?

https://github.com/lyft/flyteplugins/blob/master/go/tasks/pluginmachinery/core/phase.go#L192

anandswaminathan · 2020-08-19T23:31:26Z

cc @kumare3 as haytham is out.

kumare3 · 2020-08-05T03:34:14Z

go/tasks/pluginmachinery/flytek8s/pod_helper.go

+			// non spot instance AZ.
+			if val, ok := pod.ObjectMeta.Labels["interruptible"]; ok {
+				if val == "true" && k8s.GetConfig().MaxSystemLevelTimeout > 0 && k8s.GetConfig().MaxSystemLevelTimeout > elapsedtime.Minutes() {
+					return pluginsCore.PhaseInfoFailed(c.LastTransitionTime.Time, &idlCore.ExecutionError_SYSTEM{


Why not use the system failure method

kumare3 · 2020-08-05T04:47:20Z

go/tasks/pluginmachinery/flytek8s/pod_helper.go

+				// non spot instance AZ.
+				if val, ok := pod.ObjectMeta.Labels["interruptible"]; ok {
+					if val == "true" && k8s.GetConfig().MaxSystemLevelTimeout > 0 && k8s.GetConfig().MaxSystemLevelTimeout > elapsedtime.Minutes() {
+						return pluginsCore.PhaseInfoFailed(c.LastTransitionTime.Time, &idlCore.ExecutionError_SYSTEM{


https://github.com/lyft/flyteplugins/blob/master/go/tasks/pluginmachinery/core/phase.go#L196

ohh i see, you want to add the time here?

kumare3 · 2020-08-20T21:52:09Z

go/tasks/pluginmachinery/flytek8s/config/config.go

@@ -77,7 +77,7 @@ type K8sPluginConfig struct {
 	// Flyte CoPilot Configuration
 	CoPilot FlyteCoPilotConfig `json:"co-pilot" pflag:",Co-Pilot Configuration"`
 	// Set system level timeout. If timeout reached pod will be relaunched.
-	MaxSystemLevelTimeout config2.Duration `json:"maxSystemLevelTimeout" pflag:"-,Value to be used for system level timeouts in minutes."`
+	MaxSystemLevelTimeout config2.Duration `json:"max-system-level-timeout" pflag:"-,Value to be used for system level timeouts in minutes."`


can we improve the description. I do not follow the description actually

kumare3 · 2021-03-17T16:10:18Z

@migueltol22 are we reviving this?

migueltol22 added 3 commits August 4, 2020 18:28

add timeout

4192d00

add msg to error

4297331

upd config && use creation and start time

bc41ba0

migueltol22 requested review from EngHabu, anandswaminathan and kumare3 August 5, 2020 19:52

EngHabu reviewed Aug 5, 2020

View reviewed changes

migueltol22 and others added 2 commits August 6, 2020 13:03

Update go/tasks/pluginmachinery/flytek8s/config/config.go

833a751

Co-authored-by: Haytham AbuelFutuh <[email protected]>

upd based on pr feedback

f330c84

migueltol22 requested a review from EngHabu August 6, 2020 20:54

EngHabu reviewed Aug 6, 2020

View reviewed changes

add conditiontrue check

b459a19

anandswaminathan reviewed Aug 19, 2020

View reviewed changes

migueltol22 added 2 commits August 19, 2020 16:48

udp to phasesystemretyrablefailure

e60b1f1

upd config

2f83f22

kumare3 reviewed Aug 20, 2020

View reviewed changes

migueltol22 added 5 commits August 21, 2020 09:42

rm - and make generate

2bed189

comment out containerStatus for test

78214d1

rm check for testing

f544008

debug logs

27ba234

move check to podScheduled

6969027

migueltol22 closed this Jul 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add System Level Timeout #109

Add System Level Timeout #109

migueltol22 commented Aug 5, 2020

codecov-commenter commented Aug 5, 2020 •

edited

Loading

EngHabu Aug 5, 2020

migueltol22 Aug 6, 2020

EngHabu Aug 5, 2020

migueltol22 Aug 6, 2020

kumare3 Mar 17, 2021

EngHabu Aug 5, 2020

anandswaminathan Aug 19, 2020

EngHabu Aug 6, 2020

migueltol22 Aug 6, 2020

katrogan Aug 11, 2020

migueltol22 Aug 17, 2020

anandswaminathan Aug 19, 2020

anandswaminathan commented Aug 19, 2020

kumare3 Aug 5, 2020

kumare3 Aug 5, 2020

kumare3 Aug 5, 2020

kumare3 Aug 20, 2020

kumare3 commented Mar 17, 2021

Add System Level Timeout #109

Add System Level Timeout #109

Conversation

migueltol22 commented Aug 5, 2020

TL;DR

Type

Are all requirements met?

Complete description

Tracking Issue

Follow-up issue

codecov-commenter commented Aug 5, 2020 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anandswaminathan commented Aug 19, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kumare3 commented Mar 17, 2021

codecov-commenter commented Aug 5, 2020 •

edited

Loading