-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
client/driver: docker progress detection and monitoring #4192
Conversation
client/driver/docker.go
Outdated
@@ -1546,7 +1550,8 @@ func (d *DockerDriver) pullImage(driverConfig *DockerDriverConfig, client *docke | |||
|
|||
d.emitEvent("Downloading image %s:%s", repo, tag) | |||
coordinator, callerID := d.getDockerCoordinator(client) | |||
return coordinator.PullImage(driverConfig.ImageName, authOptions, callerID) | |||
|
|||
return coordinator.PullImage(driverConfig.ImageName, authOptions, time.Duration(driverConfig.ImagePullTimeout)*time.Second, callerID, d.emitEvent) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The casting in the return args is ugly and hard to read, making a note to self to fix this.
client/driver/docker.go
Outdated
@@ -234,6 +234,7 @@ type DockerDriverConfig struct { | |||
ReadonlyRootfs bool `mapstructure:"readonly_rootfs"` // Mount the container’s root filesystem as read only | |||
AdvertiseIPv6Address bool `mapstructure:"advertise_ipv6_address"` // Flag to use the GlobalIPv6Address from the container as the detected IP | |||
CPUHardLimit bool `mapstructure:"cpu_hard_limit"` // Enforce CPU hard limit. | |||
ImagePullTimeout int64 `mapstructure:"image_pull_timeout"` // Timeout on the image pull after which the pull is cancelled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be nice to allow specifying this as a duration. You'd have to add a ImagePullTimeoutRaw string
field and parse it into time.Duration in DockerDriverConfig.Validate, so might not be worth it... Either way.
client/driver/docker_coordinator.go
Outdated
if pullTimeout > 0 { | ||
ctx, cancel = context.WithDeadline(context.Background(), time.Now().Add(pullTimeout)) | ||
} | ||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put this above the if
block so it cancels the first context and propagates to the child deadline context (if one exists). I don't think leaking a cancellable context will cause a memory leak, but better to be careful than rely on internals.
client/driver/docker_progress.go
Outdated
func newImageProgressManager( | ||
image string, cancel context.CancelFunc, | ||
inactivityFunc, reporter progressReporterFunc) *imageProgressManager { | ||
return &imageProgressManager{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put an empty line between multiline func definitions and the first line of their body
client/driver/docker_coordinator.go
Outdated
func (d *dockerCoordinator) handlePullProgressReport(image, msg string, timestamp, pullStart time.Time) { | ||
d.logger.Printf("[DEBUG] driver.docker: image %s pull progress: %s", image, msg) | ||
|
||
if timestamp.Sub(pullStart) > 2*time.Minute { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put 2*minute
in a package level const with documentation.
client/driver/docker_progress.go
Outdated
} | ||
} | ||
|
||
type imageProgress struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to comment internal structs as if they were exported.
client/driver/docker_progress.go
Outdated
return pm | ||
} | ||
|
||
func (pm *imageProgressManager) withReportInterval(t time.Duration) *imageProgressManager { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see where this or the above method are used. Since they're not safe for concurrent use with start()
let's not leave them lying around if they're not needed.
client/driver/docker_progress.go
Outdated
for { | ||
line, err := pm.buf.ReadBytes('\n') | ||
if err == io.EOF { | ||
pm.buf.Write(line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a comment here like:
// Partial write of a line; break until a full line is written
return n, err | ||
} | ||
|
||
if msg.Error != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment as to what these errors mean would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️
{"status":"Pulling fs layer","progressDetail":{},"id":"c73ab1c6897b"} | ||
{"status":"Pulling fs layer","progressDetail":{},"id":"1ab373b3deae"} | ||
`)) | ||
assert.NoError(t, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally use testify/require
instead of testify/assert
as assert doesn't fail immediately. It's no big deal but in cases like this where you're checking for errors, the very next line is likely to panic and make for confusing test output if there is an error.
client/driver/docker_progress.go
Outdated
type progressReporterFunc func(image string, msg string, timestamp time.Time, pullStart time.Time) | ||
|
||
type imageProgressManager struct { | ||
*imageProgress |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I'm a fan of this embedding. It causes imageProgressManager
to have RWMutex's methods despite those methods only guarding fields on imageProgress
. Perhaps just give it a named field to make it clear for uses in imageProgressManager
which structs "owns" which fields.
Not a big deal. I do like the overall approach of bundling up the fields that need guarded by a mutex with a little oneshot struct.
client/driver/docker_coordinator.go
Outdated
ctx, cancel := context.WithCancel(context.Background()) | ||
defer cancel() | ||
if pullTimeout > 0 { | ||
ctx, cancel = context.WithDeadline(context.Background(), time.Now().Add(pullTimeout)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tricked you into introducing a bug. This creates a new context from Background(), not ctx
, so we're back to leaking a context again 😬
I think just change Background() -> ctx
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doh!
client/driver/docker.go
Outdated
@@ -303,6 +305,12 @@ func (c *DockerDriverConfig) Validate() error { | |||
return err | |||
} | |||
c.Ulimit = ulimit | |||
if len(c.ImagePullTimeoutRaw) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we introduce a min timeout, something like 1m
// Parse the repo and tag | ||
repo, tag := docker.ParseRepositoryTag(image) | ||
if tag == "" { | ||
tag = "latest" | ||
} | ||
ctx, cancel := context.WithCancel(context.Background()) | ||
defer cancel() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This wouldn't cancel the context created below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@schmichael convinced me that canceling a parent context will cancel any child contexts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed this when I removed the pull deadline
client/driver/docker_coordinator.go
Outdated
d.logger.Printf("[DEBUG] driver.docker: image %s pull progress: %s", image, msg) | ||
|
||
if timestamp.Sub(pullStart) > dockerPullProgressEmitInterval { | ||
d.emitEvent(image, "Docker image %s pull progress: %s", image, msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will emit an event every 10 seconds after the initial 2 minutes? I think we only want to emit at max once every two minutes
client/driver/docker_coordinator.go
Outdated
} | ||
|
||
func (d *dockerCoordinator) handlePullInactivity(image, msg string, timestamp, pullStart time.Time) { | ||
d.logger.Printf("[ERR] driver.docker: image %s pull aborted due to inactivity, last message recevieved at [%s]: %s", image, timestamp.String(), msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this not emit an event?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Emit and log or just emit?
client/driver/docker_progress.go
Outdated
select { | ||
case <-ticker.C: | ||
msg, timestamp := pm.imageProgress.get() | ||
if time.Now().Sub(timestamp) > pm.activityDeadline { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want it to be based purely on the time? What if set
was called without any additional bytes pulled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thats ok. If set is called from Write then a full JSON message was sent and decoded. This includes extracting, verifying signatures etc which is still activity.
client/driver/docker_progress.go
Outdated
// start intiates the ticker to trigger the inactivity and reporter handlers | ||
func (pm *imageProgressManager) start() { | ||
pm.imageProgress.pullStart = time.Now() | ||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts on having newImageProgressManager
call go pm.start()
and then the interface for callers just becomes new and stop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only possible hiccup I see is that there are some fields that we don't set in the newImageProgressManager
. They are currently set to default, ex: reportInterval
, activityDeadline
, but could be overridden after the new call and before start
. If we ever want to set these I suppose we just add the args to the new call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✔️ Added a call to pm.start()
in the newImageProgressManager
method.
b6d2b75
to
9539687
Compare
@schmichael could you give this one more look when you have a moment |
9539687
to
a58a914
Compare
client/driver/docker_coordinator.go
Outdated
@@ -98,6 +104,12 @@ type dockerCoordinator struct { | |||
// only have one request be sent to Docker | |||
pullFutures map[string]*pullFuture | |||
|
|||
// pullLoggers is used to track the LogEventFn for each alloc pulling an image |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth adding a comment here about why it needs a slice of pull loggers if the map is keyed by docker image id.
client/driver/docker_coordinator.go
Outdated
|
||
func (d *dockerCoordinator) handlePullInactivity(image, msg string, timestamp, pullStart time.Time, interval int64) { | ||
d.logger.Printf("[ERR] driver.docker: image %s pull aborted due to inactivity, last message recevieved at [%s]: %s", image, timestamp.String(), msg) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove empty line
client/driver/docker_coordinator.go
Outdated
func (d *dockerCoordinator) handlePullProgressReport(image, msg string, timestamp, pullStart time.Time, interval int64) { | ||
d.logger.Printf("[DEBUG] driver.docker: image %s pull progress: %s", image, msg) | ||
|
||
if interval%int64(dockerPullProgressEmitInterval.Seconds()/dockerImageProgressReportInterval.Seconds()) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor - any reason not to handle the "should emit" logic in the caller? This method doesn't need to know what the interval value is or do the math with the two progress report intervals, the caller can just tell it whether it needs to emit the event in addition to logging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. I think I tried to keep the imageProgressManager
scoped as small as possible to start. So it currently doesn't make a distinction of to emit or not, just that it needs to fire a progressReporterFunc
when it detects inactivity and at each reporting interval. I don't love the way this handlePullProgressReport turned out though.
I'll go back and play with it for a few mins and see what moving the "should emit" logic into the manager looks like.
client/driver/docker_progress.go
Outdated
} | ||
|
||
// set takes a status message received from the docker engine api during an image | ||
// pull and updates the status of the coorisponding layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo - corresponding
case "Extracting": | ||
return layerProgressStatusExtracting | ||
case "Pull complete": | ||
return layerProgressStatusComplete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one appears to be defined but not used, is that intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some statuses aren't referenced directly but are used in the comparison on line 104. I could condense them all down into the same int, but I figured it would be best to enumerate all of them.
case "Pulling fs layer": | ||
return layerProgressStatusStarting | ||
case "Waiting": | ||
return layerProgressStatusWaiting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is also unused. Did we want to check if the status is layerProgressWaiting
in the get()
method below and return a nicer string message for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went ahead an enumerated each status even though there may be a couple that don't get used. I didn't want to make the status string too long, you can already derive the number of layers waiting (total - pulled - pulling).
If I add waiting it could look something like:
2018/04/24 12:47:12.076066 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (106.7 MiB/227.3 MiB) pulling 3 waiting on 2 - est 11.2s remaining
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I added the waiting status to the msg and it now looks like:
2018/05/03 15:18:43.299862 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (57.65 MiB/227.3 MiB) layers: 2 waiting/3 pulling - est 29.3s remaining
2018/05/03 15:18:53.299760 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (115.2 MiB/227.3 MiB) layers: 2 waiting/3 pulling - est 19.3s remaining
2018/05/03 15:19:03.299919 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 4/6 (173 MiB/227.3 MiB) layers: 0 waiting/2 pulling - est 9.3s remaining
2018/05/03 15:19:13.299778 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 5/6 (223.3 MiB/227.3 MiB) layers: 0 waiting/1 pulling - est 0.5s remaining
2018/05/03 15:19:16.757129 [DEBUG] driver.docker: docker pull docker.elastic.co/elasticsearch/elasticsearch:6.2.4 succeeded
client/driver/docker_progress.go
Outdated
return b | ||
} | ||
|
||
// progressReporterFunc defines the method for handeling inactivity and report |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handling
…d to unexpected timeouts If two jobs are pulling the same image simultaneously, which ever starts the pull first will set the pull timeout. This can lead to a poor UX where the first job requested a short timeout while the second job requested a longer timeout causing the pull to potentially timeout much sooner than expected by the second job.
bef2845
to
d442a44
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This came out great! The reporting intervals became quite clean! Good job!
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
This PR aims to add the following enhancements to the docker driver:
Pull timeout, configured in the jobspec asRemoved from this PR see b6d2b75image_pull_timeout
which will abort the pull if it does not complete in the configured durationExample log: