Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client/driver: docker progress detection and monitoring #4192

Merged
merged 7 commits into from
May 7, 2018

Conversation

nickethier
Copy link
Member

@nickethier nickethier commented Apr 20, 2018

This PR aims to add the following enhancements to the docker driver:

  • Inactivity timeout which aborts a pull of no progress has been made during a defined period of time
  • Pull timeout, configured in the jobspec as image_pull_timeout which will abort the pull if it does not complete in the configured duration Removed from this PR see b6d2b75
  • Image pull progress logging to client stdout
    • Multiplexed to all allocs pulling the same image.
  • Image pull progress logged to alloc if the pull lasts longer than 2 mins (and subsequent msgs ever 2 mins there after)

Example log:

2018/04/24 12:47:12.076066 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (106.7 MiB/227.3 MiB) pulling 3 layers - est 11.2s remaining
2018/04/24 12:47:22.076012 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 5/6 (191.4 MiB/227.3 MiB) pulling 1 layers - est 3.6s remaining
2018/04/24 12:47:32.076002 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 6/6 (227.3 MiB/227.3 MiB) pulling 0 layers - est 0.1s remaining
2018/04/24 12:47:33.394116 [DEBUG] driver.docker: docker pull docker.elastic.co/elasticsearch/elasticsearch:6.2.4 succeeded

@nickethier nickethier changed the title [WIP] client/driver: docker progress detection and monitoring client/driver: docker progress detection and monitoring Apr 23, 2018
@@ -1546,7 +1550,8 @@ func (d *DockerDriver) pullImage(driverConfig *DockerDriverConfig, client *docke

d.emitEvent("Downloading image %s:%s", repo, tag)
coordinator, callerID := d.getDockerCoordinator(client)
return coordinator.PullImage(driverConfig.ImageName, authOptions, callerID)

return coordinator.PullImage(driverConfig.ImageName, authOptions, time.Duration(driverConfig.ImagePullTimeout)*time.Second, callerID, d.emitEvent)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The casting in the return args is ugly and hard to read, making a note to self to fix this.

@@ -234,6 +234,7 @@ type DockerDriverConfig struct {
ReadonlyRootfs bool `mapstructure:"readonly_rootfs"` // Mount the container’s root filesystem as read only
AdvertiseIPv6Address bool `mapstructure:"advertise_ipv6_address"` // Flag to use the GlobalIPv6Address from the container as the detected IP
CPUHardLimit bool `mapstructure:"cpu_hard_limit"` // Enforce CPU hard limit.
ImagePullTimeout int64 `mapstructure:"image_pull_timeout"` // Timeout on the image pull after which the pull is cancelled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be nice to allow specifying this as a duration. You'd have to add a ImagePullTimeoutRaw string field and parse it into time.Duration in DockerDriverConfig.Validate, so might not be worth it... Either way.

if pullTimeout > 0 {
ctx, cancel = context.WithDeadline(context.Background(), time.Now().Add(pullTimeout))
}
defer cancel()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this above the if block so it cancels the first context and propagates to the child deadline context (if one exists). I don't think leaking a cancellable context will cause a memory leak, but better to be careful than rely on internals.

func newImageProgressManager(
image string, cancel context.CancelFunc,
inactivityFunc, reporter progressReporterFunc) *imageProgressManager {
return &imageProgressManager{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put an empty line between multiline func definitions and the first line of their body

func (d *dockerCoordinator) handlePullProgressReport(image, msg string, timestamp, pullStart time.Time) {
d.logger.Printf("[DEBUG] driver.docker: image %s pull progress: %s", image, msg)

if timestamp.Sub(pullStart) > 2*time.Minute {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put 2*minute in a package level const with documentation.

}
}

type imageProgress struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to comment internal structs as if they were exported.

return pm
}

func (pm *imageProgressManager) withReportInterval(t time.Duration) *imageProgressManager {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where this or the above method are used. Since they're not safe for concurrent use with start() let's not leave them lying around if they're not needed.

for {
line, err := pm.buf.ReadBytes('\n')
if err == io.EOF {
pm.buf.Write(line)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a comment here like:

// Partial write of a line; break until a full line is written

return n, err
}

if msg.Error != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment as to what these errors mean would be useful.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️

{"status":"Pulling fs layer","progressDetail":{},"id":"c73ab1c6897b"}
{"status":"Pulling fs layer","progressDetail":{},"id":"1ab373b3deae"}
`))
assert.NoError(t, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally use testify/require instead of testify/assert as assert doesn't fail immediately. It's no big deal but in cases like this where you're checking for errors, the very next line is likely to panic and make for confusing test output if there is an error.

type progressReporterFunc func(image string, msg string, timestamp time.Time, pullStart time.Time)

type imageProgressManager struct {
*imageProgress
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I'm a fan of this embedding. It causes imageProgressManager to have RWMutex's methods despite those methods only guarding fields on imageProgress. Perhaps just give it a named field to make it clear for uses in imageProgressManager which structs "owns" which fields.

Not a big deal. I do like the overall approach of bundling up the fields that need guarded by a mutex with a little oneshot struct.

ctx, cancel := context.WithCancel(context.Background())
defer cancel()
if pullTimeout > 0 {
ctx, cancel = context.WithDeadline(context.Background(), time.Now().Add(pullTimeout))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tricked you into introducing a bug. This creates a new context from Background(), not ctx, so we're back to leaking a context again 😬

I think just change Background() -> ctx here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doh!

@@ -303,6 +305,12 @@ func (c *DockerDriverConfig) Validate() error {
return err
}
c.Ulimit = ulimit
if len(c.ImagePullTimeoutRaw) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we introduce a min timeout, something like 1m

// Parse the repo and tag
repo, tag := docker.ParseRepositoryTag(image)
if tag == "" {
tag = "latest"
}
ctx, cancel := context.WithCancel(context.Background())
defer cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wouldn't cancel the context created below

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@schmichael convinced me that canceling a parent context will cancel any child contexts.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this when I removed the pull deadline

d.logger.Printf("[DEBUG] driver.docker: image %s pull progress: %s", image, msg)

if timestamp.Sub(pullStart) > dockerPullProgressEmitInterval {
d.emitEvent(image, "Docker image %s pull progress: %s", image, msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will emit an event every 10 seconds after the initial 2 minutes? I think we only want to emit at max once every two minutes

}

func (d *dockerCoordinator) handlePullInactivity(image, msg string, timestamp, pullStart time.Time) {
d.logger.Printf("[ERR] driver.docker: image %s pull aborted due to inactivity, last message recevieved at [%s]: %s", image, timestamp.String(), msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this not emit an event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emit and log or just emit?

select {
case <-ticker.C:
msg, timestamp := pm.imageProgress.get()
if time.Now().Sub(timestamp) > pm.activityDeadline {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want it to be based purely on the time? What if set was called without any additional bytes pulled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats ok. If set is called from Write then a full JSON message was sent and decoded. This includes extracting, verifying signatures etc which is still activity.

// start intiates the ticker to trigger the inactivity and reporter handlers
func (pm *imageProgressManager) start() {
pm.imageProgress.pullStart = time.Now()
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any thoughts on having newImageProgressManager call go pm.start() and then the interface for callers just becomes new and stop?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only possible hiccup I see is that there are some fields that we don't set in the newImageProgressManager. They are currently set to default, ex: reportInterval, activityDeadline, but could be overridden after the new call and before start. If we ever want to set these I suppose we just add the args to the new call.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✔️ Added a call to pm.start() in the newImageProgressManager method.

@nickethier nickethier force-pushed the f-docker-progress-detection branch from b6d2b75 to 9539687 Compare April 26, 2018 00:40
@nickethier
Copy link
Member Author

@schmichael could you give this one more look when you have a moment

@nickethier nickethier force-pushed the f-docker-progress-detection branch from 9539687 to a58a914 Compare May 1, 2018 03:16
@@ -98,6 +104,12 @@ type dockerCoordinator struct {
// only have one request be sent to Docker
pullFutures map[string]*pullFuture

// pullLoggers is used to track the LogEventFn for each alloc pulling an image
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth adding a comment here about why it needs a slice of pull loggers if the map is keyed by docker image id.


func (d *dockerCoordinator) handlePullInactivity(image, msg string, timestamp, pullStart time.Time, interval int64) {
d.logger.Printf("[ERR] driver.docker: image %s pull aborted due to inactivity, last message recevieved at [%s]: %s", image, timestamp.String(), msg)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove empty line

func (d *dockerCoordinator) handlePullProgressReport(image, msg string, timestamp, pullStart time.Time, interval int64) {
d.logger.Printf("[DEBUG] driver.docker: image %s pull progress: %s", image, msg)

if interval%int64(dockerPullProgressEmitInterval.Seconds()/dockerImageProgressReportInterval.Seconds()) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor - any reason not to handle the "should emit" logic in the caller? This method doesn't need to know what the interval value is or do the math with the two progress report intervals, the caller can just tell it whether it needs to emit the event in addition to logging.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. I think I tried to keep the imageProgressManager scoped as small as possible to start. So it currently doesn't make a distinction of to emit or not, just that it needs to fire a progressReporterFunc when it detects inactivity and at each reporting interval. I don't love the way this handlePullProgressReport turned out though.

I'll go back and play with it for a few mins and see what moving the "should emit" logic into the manager looks like.

}

// set takes a status message received from the docker engine api during an image
// pull and updates the status of the coorisponding layer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo - corresponding

case "Extracting":
return layerProgressStatusExtracting
case "Pull complete":
return layerProgressStatusComplete
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one appears to be defined but not used, is that intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some statuses aren't referenced directly but are used in the comparison on line 104. I could condense them all down into the same int, but I figured it would be best to enumerate all of them.

case "Pulling fs layer":
return layerProgressStatusStarting
case "Waiting":
return layerProgressStatusWaiting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is also unused. Did we want to check if the status is layerProgressWaiting in the get() method below and return a nicer string message for it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead an enumerated each status even though there may be a couple that don't get used. I didn't want to make the status string too long, you can already derive the number of layers waiting (total - pulled - pulling).

If I add waiting it could look something like:

2018/04/24 12:47:12.076066 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (106.7 MiB/227.3 MiB) pulling 3 waiting on 2 - est 11.2s remaining

Copy link
Member Author

@nickethier nickethier May 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I added the waiting status to the msg and it now looks like:

2018/05/03 15:18:43.299862 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (57.65 MiB/227.3 MiB) layers: 2 waiting/3 pulling - est 29.3s remaining
2018/05/03 15:18:53.299760 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 1/6 (115.2 MiB/227.3 MiB) layers: 2 waiting/3 pulling - est 19.3s remaining
2018/05/03 15:19:03.299919 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 4/6 (173 MiB/227.3 MiB) layers: 0 waiting/2 pulling - est 9.3s remaining
2018/05/03 15:19:13.299778 [DEBUG] driver.docker: image docker.elastic.co/elasticsearch/elasticsearch:6.2.4 pull progress: Pulled 5/6 (223.3 MiB/227.3 MiB) layers: 0 waiting/1 pulling - est 0.5s remaining
2018/05/03 15:19:16.757129 [DEBUG] driver.docker: docker pull docker.elastic.co/elasticsearch/elasticsearch:6.2.4 succeeded

return b
}

// progressReporterFunc defines the method for handeling inactivity and report
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handling

nickethier added 7 commits May 7, 2018 12:17
…d to unexpected timeouts

If two jobs are pulling the same image simultaneously, which ever starts the pull first will set the pull timeout.
This can lead to a poor UX where the first job requested a short timeout while the second job requested a longer timeout
causing the pull to potentially timeout much sooner than expected by the second job.
@nickethier nickethier force-pushed the f-docker-progress-detection branch from bef2845 to d442a44 Compare May 7, 2018 16:21
@nickethier nickethier merged commit 006c6f3 into master May 7, 2018
@nickethier nickethier deleted the f-docker-progress-detection branch May 7, 2018 20:15
Copy link
Contributor

@dadgar dadgar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This came out great! The reporting intervals became quite clean! Good job!

@github-actions
Copy link

github-actions bot commented Mar 5, 2023

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants