Add daemon to pull kernel images to each node #641

kevin-bates · 2019-04-17T01:32:02Z

This adds support for a Kernel Image Puller image, meant to be run in a DaemonSet, that will periodically gather the set of configured kernel image names from Enterprise Gateway and pull the image that image to the node on which it is running. Since it's configured as a DaemonSet, it will be running on each node (as well as started on newly added nodes) of the cluster. This idea was discussed in #638.

The configurable items are all established via environment variables since that is the natural way to use parameters in container configurations. These include:

KIP_INTERVAL: the number of seconds at which it waits before performing the operation again. Default is every 5 minutes (300)
KIP_NUM_PULLERS: the number of threads active at any given time operating from a shared queue. The default number of puller threads is 2 but may vary.
KIP_NUM_RETRIES: the number retries to perform if issues are encountered. The exception to this are image not found exceptions. Default is 3.
KIP_GATEWAY_HOST: the host (and path) that precedes the kernelspecs endpoint.

esevan · 2019-04-17T05:33:54Z

etc/docker/kernel-image-puller/kernel_image_puller.py

+    end_point = '{}/api/kernelspecs'.format(gateway_host)
+    logger.info("Fetching kernelspecs from '{}' ...".format(end_point))
+    response = requests.get(end_point)
+    return response.json()


When server is unhealthy, this would raise ValueError. To clarify response error, I think it is better to raise clear error such as HTTPError

if not r.ok: raise requests.exceptions.HTTPError('Server reponse: {}'.format(r.status_code))

kevin-bates · 2019-04-17T13:08:47Z

I thought of a couple other things in my sleep.

The behavior here is essentially an Always pull policy since the pull is attempted on each iteration. Since the default pull policy is IfNotPresent, I think we should remember the images that have been pulled (simple set) and add a parameter KIP_PULL_POLICY that, if IfNotPresent, only pulls images it hasn't already pulled.
We should still try to come up with a solution based on notifications rather than polling from each node since I'm a little concerned about scalability (100 nodes all retrieving kernelspecs every N seconds). That said, I'm not sure how to go about detecting new images without some kind of minimal polling or radical change (like detecting new kernelspecs). I suppose we could move the polling into EG (hooked into the tornado loop - similar to how idle watching works), that essentially does the gathering and sees what hasn't been gathered before, then triggers some kind of event for new entries. Then the DaemonSet is merely monitoring that event, which then triggers its current behavior. However, I think this approach would be limited to only an IfNotPresent policy. Is that good enough though? Yeah, Always pods will incur slow downs, but only when the actual image was changed. OTOH, admins could adopt behaviors like "hold off using a given kernel until the cluster has had a chance to pull".
Truncate the pull timing values to 3 decimal places (ms).

esevan · 2019-04-17T15:19:56Z

I suppose we could move the polling into EG

Hmm, this is just my opinion -> I think it makes sense that "Detecting kernelspec change" in EG (event), "Detecting image change" (periodic) and "Detecting new node" (event) in KIP. I'm still not sure "image change" event can be detected unless watching it periodically.

"hold off using a given kernel until the cluster has had a chance to pull"

Isn't this good to avoid "timeout" issue? Looks like "rolling upgrade" and "zero-downtime" upgrade.

witten · 2019-04-17T18:32:52Z

Enterprise Gateway could publish its kernelspec list to a ConfigMap whenever it changes, and then the image puller could kubectl get configmap --watch thatconfigmap, thereby hopefully sidestepping the scalability concern.. Let K8s do the notifications for you! That of course only helps with the "Detecting kernelspec change" case.

kevin-bates · 2019-04-17T19:05:59Z

Right, this would move the polling to one spot instead of N. This would force us to a 'IfNotPresent' model - which, IMO, is sufficient since tag names should vary in the real world.

There's some heavy lifting here in that we'd need to override the KernelSpecManager, have it perform k8s ops (only when configured for k8s), then modify KIP to watch the map. Since there's a chance this area could change depending on the jupyter server project, I'm inclined to leave things simple for now and see how it goes. Once the server direction is set or we find scalability is a concern in this area, we can check back on this.

Does that sound ok to others?

witten · 2019-04-17T19:09:19Z

Seems like a reasonable place to start!

kevin-bates · 2019-04-17T23:20:23Z

Updated to include helm presence. I decided not to parameterize the service name since that would imply the same happen for nearly all other objects. Also, I have not parameterized the complete set of KIP parameters - only those I suspect might be adjusted more frequently.

Makefiles, documentation and release scripts have also been updated.

What is NOT updated at this time is the docker-compose file since I'm not sure how this works. I suspect it's a matter of defining a global service. I'm hoping @mattjtodd lend his docker expertise for this.

I'm going to remove WIP. If we find more is needed or feel like including Docker in this merge, we can do that.

Your continued reviews are appreciated!

Kernel Image Puller (KIP) periodically polls Enterprise Gateway to fetch the current set of configured kernelspecs, locating the kernel image names from each. It then pulls those image to the node on which it is running. Depending on its configured pull policy (KIP_PULL_POLICY), it will either always attempt to pull each image (Always) or only pull if it hasn't pulled that image before (IfNotPresent). On Kubernetes, KIP is configured as a DaemonSet. On Docker Swarm, it will likely be configured as a global service (TBD). Fixes jupyter-server#638

esevan

I want to use KIP in stg only. Is it feasible to make KIP optionally turned on?

kevin-bates · 2019-04-24T01:29:39Z

Yes. I'll add a flag for helm tomorrow. I guess the daemonset will need to be commented out in the eg yaml file.

kevin-bates · 2019-04-24T16:08:58Z

@esevan - I updated with the default option being true. This means the eg yaml can leave the daemonset configured since that's the default.

esevan

This is a terrific feature. LGTM. @kevin-bates The commit message indicates this fixes #638. I agree, but there're some other topics in the issue though. How do we leave those issues as not fixed? Do you agree we have to leave them?

Event-driven image update.
Integration with kernel spec manager
...

kevin-bates · 2019-04-24T23:27:56Z

@esevan - thank you for the review and comments.

While I definitely think there's room for improvement here, I believe we should treat those as future enhancements and go ahead and merge this PR (assuming others are okay with it).

I've been thinking about these issues quite a bit - mostly in how can KIP see a kernelspec before the users - and have thought of a couple approaches. Both of which would require adding at a minimum a kernelspec handler and, possibly, a kernelspec manager class.

Hidden Kernelspecs. The idea here would be to introduce a field in the metadata: stanza like hidden: True. If hidden, the handler would filter the specs it should return - removing those marked as hidden. We would then need to add a query parameter of some sort, like, /api/kernelspecs?include_hidden=True that KIP (and perhaps devOps) would use. Once the hidden kernelspec has been validated - both in its content and operation but also in KIP's pulling of the image - the field would be removed or set to false - making the kernelspec available to end users.
Authorized Users. We currently support the notion of authorized_users (and unauthorized_users) lists at a per kernelspec basis, but these lists are only enforced at kernel startup using the value of KERNEL_USERNAME in the json body. As a result, users that are not permitted to use certain kernelspecs will see those kernelspecs as options - only to fail on attempted use. It would be nice to include this enforcement when fetching the kernelspecs so users only see what they can use. Solving this introduces a number of things.
1. All clients would need to include something akin to a query parameter like ?kernel_username=alice. We'd tackle the Notebook/Lab cases via NB2KG. Failure to add this parameter means that they could not see any kernelspecs with a non-zero users list. If it didn't exist in the kernelspecs, the spec would still be returned.
2. KIP would need to use some kind of meta-user or token that indicates its special/privileged OR its username would need to be added to all kernelspecs (which seems like a non-starter) - so preferrably something like the former.
3. These fields are currently buried in metadata.process_proxy.config and should probably get moved to metadata since they really aren't process-proxy specific. (They existed before we moved process_proxy to metadata, so they needed a 'container' at the time.) They should be moved irrespective of any other decisions - IMO.

Even if we wanted to address authorized_users for kernelspecs, we'd need to resolve how KIP could access specs regardless.

So, I guess I'm leaning toward Hidden Kernelspecs although both would be nice (I think we need a way to say a certain request (e.g., from KIP) is privileged anyhow. ideas welcome!)

The other area I've been looking into is dynamic configuration in general. Via a debugger I found that if I load the configuration after startup (and after making a change), I see the updated value reflected in the member (traitlet) variable. This implies that you could have a PeriodicCallback that simply refreshes the configuration (although this is another case where 'watching' is better). So things like remote-hosts, authorized-users, or whitelists could (in theory) be updated without having to restart EG.

Sorry for the ramble. I've been thinking about this stuff literally for the past couple days and its good to talk about it. Great therapy. 😄

esevan · 2019-04-24T23:41:21Z

@kevin-bates So deep diving into the issue! Thanks for sharing your idea. Moving your comment to the issue will be very useful to share with others!

kevin-bates · 2019-05-02T16:34:54Z

I added a couple related issues last week. Are others okay to have this merged?

akchinSTC

Installed on K8s cluster via Helm. LGTM. Thanks Kevin

esevan reviewed Apr 17, 2019

View reviewed changes

kevin-bates force-pushed the kernel-image-puller branch from f83069b to 50f98c6 Compare April 17, 2019 23:13

kevin-bates changed the title ~~[WIP] Add daemon to pull kernel images to each node~~ Add daemon to pull kernel images to each node Apr 17, 2019

kevin-bates force-pushed the kernel-image-puller branch 2 times, most recently from 5939c8e to 0c2575b Compare April 18, 2019 17:51

kevin-bates mentioned this pull request Apr 23, 2019

Kubernetes ingress #626

Closed

esevan reviewed Apr 24, 2019

View reviewed changes

Make KIP optional (default = True)

5a71ac3

kevin-bates force-pushed the kernel-image-puller branch from 0c2575b to 5a71ac3 Compare April 24, 2019 16:06

esevan approved these changes Apr 24, 2019

View reviewed changes

This was referenced Apr 25, 2019

Add support for hidden or offline kernelspecs #646

Closed

Explore the ability to support dynamic configurations #647

Closed

lresende requested a review from akchinSTC May 7, 2019 18:27

akchinSTC assigned kevin-bates May 7, 2019

akchinSTC approved these changes May 7, 2019

View reviewed changes

akchinSTC merged commit e9d720b into jupyter-server:master May 7, 2019

kevin-bates deleted the kernel-image-puller branch June 19, 2019 16:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add daemon to pull kernel images to each node #641

Add daemon to pull kernel images to each node #641

kevin-bates commented Apr 17, 2019

esevan Apr 17, 2019

kevin-bates commented Apr 17, 2019

esevan commented Apr 17, 2019

witten commented Apr 17, 2019

kevin-bates commented Apr 17, 2019

witten commented Apr 17, 2019

kevin-bates commented Apr 17, 2019

esevan left a comment

kevin-bates commented Apr 24, 2019

kevin-bates commented Apr 24, 2019

esevan left a comment •

edited

Loading

kevin-bates commented Apr 24, 2019

esevan commented Apr 24, 2019

kevin-bates commented May 2, 2019

akchinSTC left a comment

Add daemon to pull kernel images to each node #641

Add daemon to pull kernel images to each node #641

Conversation

kevin-bates commented Apr 17, 2019

esevan Apr 17, 2019

Choose a reason for hiding this comment

kevin-bates commented Apr 17, 2019

esevan commented Apr 17, 2019

witten commented Apr 17, 2019

kevin-bates commented Apr 17, 2019

witten commented Apr 17, 2019

kevin-bates commented Apr 17, 2019

esevan left a comment

Choose a reason for hiding this comment

kevin-bates commented Apr 24, 2019

kevin-bates commented Apr 24, 2019

esevan left a comment • edited Loading

Choose a reason for hiding this comment

kevin-bates commented Apr 24, 2019

esevan commented Apr 24, 2019

kevin-bates commented May 2, 2019

akchinSTC left a comment

Choose a reason for hiding this comment

esevan left a comment •

edited

Loading