Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add daemon to pull kernel images to each node #641

Merged
merged 2 commits into from
May 7, 2019

Conversation

kevin-bates
Copy link
Member

This adds support for a Kernel Image Puller image, meant to be run in a DaemonSet, that will periodically gather the set of configured kernel image names from Enterprise Gateway and pull the image that image to the node on which it is running. Since it's configured as a DaemonSet, it will be running on each node (as well as started on newly added nodes) of the cluster. This idea was discussed in #638.

The configurable items are all established via environment variables since that is the natural way to use parameters in container configurations. These include:

  • KIP_INTERVAL: the number of seconds at which it waits before performing the operation again. Default is every 5 minutes (300)
  • KIP_NUM_PULLERS: the number of threads active at any given time operating from a shared queue. The default number of puller threads is 2 but may vary.
  • KIP_NUM_RETRIES: the number retries to perform if issues are encountered. The exception to this are image not found exceptions. Default is 3.
  • KIP_GATEWAY_HOST: the host (and path) that precedes the kernelspecs endpoint.

end_point = '{}/api/kernelspecs'.format(gateway_host)
logger.info("Fetching kernelspecs from '{}' ...".format(end_point))
response = requests.get(end_point)
return response.json()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When server is unhealthy, this would raise ValueError. To clarify response error, I think it is better to raise clear error such as HTTPError

if not r.ok:
    raise requests.exceptions.HTTPError('Server reponse: {}'.format(r.status_code))

@kevin-bates
Copy link
Member Author

I thought of a couple other things in my sleep.

  1. The behavior here is essentially an Always pull policy since the pull is attempted on each iteration. Since the default pull policy is IfNotPresent, I think we should remember the images that have been pulled (simple set) and add a parameter KIP_PULL_POLICY that, if IfNotPresent, only pulls images it hasn't already pulled.
  2. We should still try to come up with a solution based on notifications rather than polling from each node since I'm a little concerned about scalability (100 nodes all retrieving kernelspecs every N seconds). That said, I'm not sure how to go about detecting new images without some kind of minimal polling or radical change (like detecting new kernelspecs). I suppose we could move the polling into EG (hooked into the tornado loop - similar to how idle watching works), that essentially does the gathering and sees what hasn't been gathered before, then triggers some kind of event for new entries. Then the DaemonSet is merely monitoring that event, which then triggers its current behavior. However, I think this approach would be limited to only an IfNotPresent policy. Is that good enough though? Yeah, Always pods will incur slow downs, but only when the actual image was changed. OTOH, admins could adopt behaviors like "hold off using a given kernel until the cluster has had a chance to pull".
  3. Truncate the pull timing values to 3 decimal places (ms).

@esevan
Copy link
Contributor

esevan commented Apr 17, 2019

I suppose we could move the polling into EG

Hmm, this is just my opinion -> I think it makes sense that "Detecting kernelspec change" in EG (event), "Detecting image change" (periodic) and "Detecting new node" (event) in KIP. I'm still not sure "image change" event can be detected unless watching it periodically.

"hold off using a given kernel until the cluster has had a chance to pull"

Isn't this good to avoid "timeout" issue? Looks like "rolling upgrade" and "zero-downtime" upgrade.

@witten
Copy link
Contributor

witten commented Apr 17, 2019

Enterprise Gateway could publish its kernelspec list to a ConfigMap whenever it changes, and then the image puller could kubectl get configmap --watch thatconfigmap, thereby hopefully sidestepping the scalability concern.. Let K8s do the notifications for you! That of course only helps with the "Detecting kernelspec change" case.

@kevin-bates
Copy link
Member Author

Right, this would move the polling to one spot instead of N. This would force us to a 'IfNotPresent' model - which, IMO, is sufficient since tag names should vary in the real world.

There's some heavy lifting here in that we'd need to override the KernelSpecManager, have it perform k8s ops (only when configured for k8s), then modify KIP to watch the map. Since there's a chance this area could change depending on the jupyter server project, I'm inclined to leave things simple for now and see how it goes. Once the server direction is set or we find scalability is a concern in this area, we can check back on this.

Does that sound ok to others?

@witten
Copy link
Contributor

witten commented Apr 17, 2019

Seems like a reasonable place to start!

@kevin-bates kevin-bates force-pushed the kernel-image-puller branch from f83069b to 50f98c6 Compare April 17, 2019 23:13
@kevin-bates
Copy link
Member Author

Updated to include helm presence. I decided not to parameterize the service name since that would imply the same happen for nearly all other objects. Also, I have not parameterized the complete set of KIP parameters - only those I suspect might be adjusted more frequently.

Makefiles, documentation and release scripts have also been updated.

What is NOT updated at this time is the docker-compose file since I'm not sure how this works. I suspect it's a matter of defining a global service. I'm hoping @mattjtodd lend his docker expertise for this.

I'm going to remove WIP. If we find more is needed or feel like including Docker in this merge, we can do that.

Your continued reviews are appreciated!

@kevin-bates kevin-bates changed the title [WIP] Add daemon to pull kernel images to each node Add daemon to pull kernel images to each node Apr 17, 2019
@kevin-bates kevin-bates force-pushed the kernel-image-puller branch 2 times, most recently from 5939c8e to 0c2575b Compare April 18, 2019 17:51
Kernel Image Puller (KIP) periodically polls Enterprise Gateway to
fetch the current set of configured kernelspecs, locating the kernel
image names from each.  It then pulls those image to the node on which
it is running.  Depending on its configured pull policy (KIP_PULL_POLICY),
it will either always attempt to pull each image (Always) or only pull
if it hasn't pulled that image before (IfNotPresent).

On Kubernetes, KIP is configured as a DaemonSet.  On Docker Swarm, it
will likely be configured as a global service (TBD).

Fixes jupyter-server#638
@kevin-bates kevin-bates mentioned this pull request Apr 23, 2019
Copy link
Contributor

@esevan esevan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to use KIP in stg only. Is it feasible to make KIP optionally turned on?

@kevin-bates
Copy link
Member Author

Yes. I'll add a flag for helm tomorrow. I guess the daemonset will need to be commented out in the eg yaml file.

@kevin-bates kevin-bates force-pushed the kernel-image-puller branch from 0c2575b to 5a71ac3 Compare April 24, 2019 16:06
@kevin-bates
Copy link
Member Author

@esevan - I updated with the default option being true. This means the eg yaml can leave the daemonset configured since that's the default.

Copy link
Contributor

@esevan esevan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a terrific feature. LGTM. @kevin-bates The commit message indicates this fixes #638. I agree, but there're some other topics in the issue though. How do we leave those issues as not fixed? Do you agree we have to leave them?

  1. Event-driven image update.
  2. Integration with kernel spec manager
  3. ...

@kevin-bates
Copy link
Member Author

@esevan - thank you for the review and comments.

While I definitely think there's room for improvement here, I believe we should treat those as future enhancements and go ahead and merge this PR (assuming others are okay with it).

I've been thinking about these issues quite a bit - mostly in how can KIP see a kernelspec before the users - and have thought of a couple approaches. Both of which would require adding at a minimum a kernelspec handler and, possibly, a kernelspec manager class.

  1. Hidden Kernelspecs. The idea here would be to introduce a field in the metadata: stanza like hidden: True. If hidden, the handler would filter the specs it should return - removing those marked as hidden. We would then need to add a query parameter of some sort, like, /api/kernelspecs?include_hidden=True that KIP (and perhaps devOps) would use. Once the hidden kernelspec has been validated - both in its content and operation but also in KIP's pulling of the image - the field would be removed or set to false - making the kernelspec available to end users.
  2. Authorized Users. We currently support the notion of authorized_users (and unauthorized_users) lists at a per kernelspec basis, but these lists are only enforced at kernel startup using the value of KERNEL_USERNAME in the json body. As a result, users that are not permitted to use certain kernelspecs will see those kernelspecs as options - only to fail on attempted use. It would be nice to include this enforcement when fetching the kernelspecs so users only see what they can use. Solving this introduces a number of things.
    1. All clients would need to include something akin to a query parameter like ?kernel_username=alice. We'd tackle the Notebook/Lab cases via NB2KG. Failure to add this parameter means that they could not see any kernelspecs with a non-zero users list. If it didn't exist in the kernelspecs, the spec would still be returned.
    2. KIP would need to use some kind of meta-user or token that indicates its special/privileged OR its username would need to be added to all kernelspecs (which seems like a non-starter) - so preferrably something like the former.
    3. These fields are currently buried in metadata.process_proxy.config and should probably get moved to metadata since they really aren't process-proxy specific. (They existed before we moved process_proxy to metadata, so they needed a 'container' at the time.) They should be moved irrespective of any other decisions - IMO.

Even if we wanted to address authorized_users for kernelspecs, we'd need to resolve how KIP could access specs regardless.

So, I guess I'm leaning toward Hidden Kernelspecs although both would be nice (I think we need a way to say a certain request (e.g., from KIP) is privileged anyhow. ideas welcome!)

The other area I've been looking into is dynamic configuration in general. Via a debugger I found that if I load the configuration after startup (and after making a change), I see the updated value reflected in the member (traitlet) variable. This implies that you could have a PeriodicCallback that simply refreshes the configuration (although this is another case where 'watching' is better). So things like remote-hosts, authorized-users, or whitelists could (in theory) be updated without having to restart EG.

Sorry for the ramble. I've been thinking about this stuff literally for the past couple days and its good to talk about it. Great therapy. 😄

@esevan
Copy link
Contributor

esevan commented Apr 24, 2019

@kevin-bates So deep diving into the issue! Thanks for sharing your idea. Moving your comment to the issue will be very useful to share with others!

@kevin-bates
Copy link
Member Author

I added a couple related issues last week. Are others okay to have this merged?

Copy link
Collaborator

@akchinSTC akchinSTC left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installed on K8s cluster via Helm. LGTM. Thanks Kevin

@akchinSTC akchinSTC merged commit e9d720b into jupyter-server:master May 7, 2019
@kevin-bates kevin-bates deleted the kernel-image-puller branch June 19, 2019 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants