-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add daemon to pull kernel images to each node #641
Add daemon to pull kernel images to each node #641
Conversation
end_point = '{}/api/kernelspecs'.format(gateway_host) | ||
logger.info("Fetching kernelspecs from '{}' ...".format(end_point)) | ||
response = requests.get(end_point) | ||
return response.json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When server is unhealthy, this would raise ValueError
. To clarify response error, I think it is better to raise clear error such as HTTPError
if not r.ok:
raise requests.exceptions.HTTPError('Server reponse: {}'.format(r.status_code))
I thought of a couple other things in my sleep.
|
Hmm, this is just my opinion -> I think it makes sense that "Detecting kernelspec change" in EG (event), "Detecting image change" (periodic) and "Detecting new node" (event) in KIP. I'm still not sure "image change" event can be detected unless watching it periodically.
Isn't this good to avoid "timeout" issue? Looks like "rolling upgrade" and "zero-downtime" upgrade. |
Enterprise Gateway could publish its kernelspec list to a ConfigMap whenever it changes, and then the image puller could |
Right, this would move the polling to one spot instead of N. This would force us to a 'IfNotPresent' model - which, IMO, is sufficient since tag names should vary in the real world. There's some heavy lifting here in that we'd need to override the KernelSpecManager, have it perform k8s ops (only when configured for k8s), then modify KIP to watch the map. Since there's a chance this area could change depending on the jupyter server project, I'm inclined to leave things simple for now and see how it goes. Once the server direction is set or we find scalability is a concern in this area, we can check back on this. Does that sound ok to others? |
Seems like a reasonable place to start! |
f83069b
to
50f98c6
Compare
Updated to include helm presence. I decided not to parameterize the service name since that would imply the same happen for nearly all other objects. Also, I have not parameterized the complete set of KIP parameters - only those I suspect might be adjusted more frequently. Makefiles, documentation and release scripts have also been updated. What is NOT updated at this time is the docker-compose file since I'm not sure how this works. I suspect it's a matter of defining a global service. I'm hoping @mattjtodd lend his docker expertise for this. I'm going to remove WIP. If we find more is needed or feel like including Docker in this merge, we can do that. Your continued reviews are appreciated! |
5939c8e
to
0c2575b
Compare
Kernel Image Puller (KIP) periodically polls Enterprise Gateway to fetch the current set of configured kernelspecs, locating the kernel image names from each. It then pulls those image to the node on which it is running. Depending on its configured pull policy (KIP_PULL_POLICY), it will either always attempt to pull each image (Always) or only pull if it hasn't pulled that image before (IfNotPresent). On Kubernetes, KIP is configured as a DaemonSet. On Docker Swarm, it will likely be configured as a global service (TBD). Fixes jupyter-server#638
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to use KIP
in stg only. Is it feasible to make KIP
optionally turned on?
Yes. I'll add a flag for helm tomorrow. I guess the daemonset will need to be commented out in the eg yaml file. |
0c2575b
to
5a71ac3
Compare
@esevan - I updated with the default option being |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a terrific feature. LGTM. @kevin-bates The commit message indicates this fixes #638. I agree, but there're some other topics in the issue though. How do we leave those issues as not fixed? Do you agree we have to leave them?
- Event-driven image update.
- Integration with kernel spec manager
- ...
@esevan - thank you for the review and comments. While I definitely think there's room for improvement here, I believe we should treat those as future enhancements and go ahead and merge this PR (assuming others are okay with it). I've been thinking about these issues quite a bit - mostly in how can KIP see a kernelspec before the users - and have thought of a couple approaches. Both of which would require adding at a minimum a kernelspec handler and, possibly, a kernelspec manager class.
Even if we wanted to address authorized_users for kernelspecs, we'd need to resolve how KIP could access specs regardless. So, I guess I'm leaning toward Hidden Kernelspecs although both would be nice (I think we need a way to say a certain request (e.g., from KIP) is privileged anyhow. ideas welcome!) The other area I've been looking into is dynamic configuration in general. Via a debugger I found that if I load the configuration after startup (and after making a change), I see the updated value reflected in the member (traitlet) variable. This implies that you could have a PeriodicCallback that simply refreshes the configuration (although this is another case where 'watching' is better). So things like remote-hosts, authorized-users, or whitelists could (in theory) be updated without having to restart EG. Sorry for the ramble. I've been thinking about this stuff literally for the past couple days and its good to talk about it. Great therapy. 😄 |
@kevin-bates So deep diving into the issue! Thanks for sharing your idea. Moving your comment to the issue will be very useful to share with others! |
I added a couple related issues last week. Are others okay to have this merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Installed on K8s cluster via Helm. LGTM. Thanks Kevin
This adds support for a Kernel Image Puller image, meant to be run in a DaemonSet, that will periodically gather the set of configured kernel image names from Enterprise Gateway and pull the image that image to the node on which it is running. Since it's configured as a DaemonSet, it will be running on each node (as well as started on newly added nodes) of the cluster. This idea was discussed in #638.
The configurable items are all established via environment variables since that is the natural way to use parameters in container configurations. These include:
KIP_INTERVAL
: the number of seconds at which it waits before performing the operation again. Default is every 5 minutes (300
)KIP_NUM_PULLERS
: the number of threads active at any given time operating from a shared queue. The default number of puller threads is 2 but may vary.KIP_NUM_RETRIES
: the number retries to perform if issues are encountered. The exception to this are image not found exceptions. Default is 3.KIP_GATEWAY_HOST
: the host (and path) that precedes the kernelspecs endpoint.