-
Notifications
You must be signed in to change notification settings - Fork 53
Conversation
Codecov Report
@@ Coverage Diff @@
## master #109 +/- ##
==========================================
+ Coverage 55.54% 58.39% +2.84%
==========================================
Files 99 99
Lines 5914 6225 +311
==========================================
+ Hits 3285 3635 +350
+ Misses 2318 2229 -89
- Partials 311 361 +50
Continue to review full report at Codecov.
|
// If the pod is interruptible and is waiting to be scheduled for an extended amount of time, it is possible there are | ||
// no spot instances availabled in the AZ. In this case, we timeout with a system level error and will retry on a | ||
// non spot instance AZ. | ||
if val, ok := pod.ObjectMeta.Labels["interruptible"]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move the interruptible
string into a const
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
for _, c := range status.Conditions { | ||
switch c.Type { | ||
case v1.PodScheduled: | ||
if c.Status == v1.ConditionFalse { | ||
// If the pod is interruptible and is waiting to be scheduled for an extended amount of time, it is possible there are |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ConditionFalse I think means this condition is no longer the "current" but it'll never be removed from the list of conditions. I do not think we should apply the timeout in this case...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep took a look at the docs and it appears it means that the condition is not applicable. Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what if we do not have any system level timeouts? Can there be a -1
- escape hatchh?
@@ -153,6 +180,17 @@ func DemystifyPending(status v1.PodStatus) (pluginsCore.PhaseInfo, error) { | |||
finalMessage := fmt.Sprintf("%s|%s", c.Message, containerStatus.State.Waiting.Message) | |||
switch reason { | |||
case "ErrImagePull", "ContainerCreating", "PodInitializing": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anand mentioned there are a couple more errors we should add here...
@anandswaminathan can you remind me?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We can follow up, should be easy to add
Co-authored-by: Haytham AbuelFutuh <[email protected]>
// If the pod is interruptible and is waiting to be scheduled for an extended amount of time, it is possible there are | ||
// no spot instances availabled in the AZ. In this case, we timeout with a system level error and will retry on a | ||
// non spot instance AZ. | ||
if val, ok := pod.ObjectMeta.Labels[Interruptible]; ok { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a check here for if c.Status == v1.ConditionTrue
or is Unschedulable a terminal condition that once entered, the pod will never exit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this comment for Unschedulable
// PodReasonUnschedulable reason in PodScheduled PodCondition means that the scheduler
// can't schedule the pod right now, for example due to insufficient resources in the cluster
However for the status that is being checked is the ConditionStatus
and if it is in ConditionTrue
then appears to mean that resource is in that condition.
// These are valid condition statuses. "ConditionTrue" means a resource is in the condition.
// "ConditionFalse" means a resource is not in the condition. "ConditionUnknown" means kubernetes
// can't decide if a resource is in the condition or not. In the future, we could add other
// intermediate conditions, e.g. ConditionDegraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm in that case wouldn't we want the check for "ConditionTrue"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the check.
// To help mitigate the pod being stuck in this state we have a system level timeout that will error out | ||
// as a system error and retry launching the pod. | ||
if timeout > 0 && timeout > time.Since(status.StartTime.Time) { | ||
return pluginsCore.PhaseInfoRetryableFailure( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @kumare3 as haytham is out. |
// non spot instance AZ. | ||
if val, ok := pod.ObjectMeta.Labels["interruptible"]; ok { | ||
if val == "true" && k8s.GetConfig().MaxSystemLevelTimeout > 0 && k8s.GetConfig().MaxSystemLevelTimeout > elapsedtime.Minutes() { | ||
return pluginsCore.PhaseInfoFailed(c.LastTransitionTime.Time, &idlCore.ExecutionError_SYSTEM{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use the system failure method
// non spot instance AZ. | ||
if val, ok := pod.ObjectMeta.Labels["interruptible"]; ok { | ||
if val == "true" && k8s.GetConfig().MaxSystemLevelTimeout > 0 && k8s.GetConfig().MaxSystemLevelTimeout > elapsedtime.Minutes() { | ||
return pluginsCore.PhaseInfoFailed(c.LastTransitionTime.Time, &idlCore.ExecutionError_SYSTEM{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh i see, you want to add the time here?
@@ -77,7 +77,7 @@ type K8sPluginConfig struct { | |||
// Flyte CoPilot Configuration | |||
CoPilot FlyteCoPilotConfig `json:"co-pilot" pflag:",Co-Pilot Configuration"` | |||
// Set system level timeout. If timeout reached pod will be relaunched. | |||
MaxSystemLevelTimeout config2.Duration `json:"maxSystemLevelTimeout" pflag:"-,Value to be used for system level timeouts in minutes."` | |||
MaxSystemLevelTimeout config2.Duration `json:"max-system-level-timeout" pflag:"-,Value to be used for system level timeouts in minutes."` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we improve the description. I do not follow the description actually
@migueltol22 are we reviving this? |
TL;DR
Handles system level timeouts in 2 cases, PodQueued and PhaseInitializing.
PodQueued is part of interruptible, where if a pod is stuck in pending for a time longer than the maxSystemLevelTimeout, we can assume that there are not enough resources in the ASG and we will then error out as a system level error and retry on the non-interruptible ASG.
PhaseInitializing is when some sort of system level error occurs that leaves us in this state. We will error with a system level error and retry launching the pod.
Type
Are all requirements met?
Complete description
How did you fix the bug, make the feature etc. Link to any design docs etc
Tracking Issue
https://github.com/lyft/flyte/issues/
Follow-up issue
NA
OR
_flyteorg/flyte#382