Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Resource Drivers section to extending Kubernetes documentation #43742

Closed
wants to merge 7 commits into from

Conversation

byako
Copy link
Contributor

@byako byako commented Oct 30, 2023

Dynamic Resource Allocation (DRA) has been available since 1.26, this adds Resource Driver concept and glossary items to help improving DRA awareness and hopefully increase the number of resource drivers out there, which would bring more feedback about the framework and help developing it further.

Fixes #38557

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. language/en Issues or PRs related to English language labels Oct 30, 2023
@k8s-ci-robot k8s-ci-robot added the sig/docs Categorizes an issue or PR as relevant to SIG Docs. label Oct 30, 2023
@netlify
Copy link

netlify bot commented Oct 30, 2023

Pull request preview available for checking

Built without sensitive environment variables

Name Link
🔨 Latest commit 043439a
🔍 Latest deploy log https://app.netlify.com/sites/kubernetes-io-main-staging/deploys/65ce141c0e66d20007706cb8
😎 Deploy Preview https://deploy-preview-43742--kubernetes-io-main-staging.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

Copy link
Member

@windsonsea windsonsea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits

Copy link
Contributor

@elezar elezar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

@sftim
Copy link
Contributor

sftim commented Oct 30, 2023

/sig node

@k8s-ci-robot k8s-ci-robot added the sig/node Categorizes an issue or PR as relevant to SIG Node. label Oct 30, 2023
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

Please match the style guide's advice on how to write API kinds and related names.

@byako byako requested a review from elezar October 31, 2023 11:49
@byako byako changed the title Add Resource Driver concept Add Resource Drivers section to extending Kubernetes documentation Oct 31, 2023
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

The proposed concept page has a lot of reference details, such as gRPC interfaces. These don't really belong in a page that has the main purpose of explaining what resource drivers are.

I'd omit the gRPC blocks and the surrounding text. Consider whether https://kubernetes.io/docs/reference/node/ is a better home for those details.

Alternatively, you can also write a one-off blog article with some details. Blog articles don't need to be complete and maintaining them after publication is optional. We can then revisit these docs for beta.

@byako
Copy link
Contributor Author

byako commented Oct 31, 2023

Thanks

The proposed concept page has a lot of reference details, such as gRPC interfaces. These don't really belong in a page that has the main purpose of explaining what resource drivers are.

I'd omit the gRPC blocks and the surrounding text. Consider whether https://kubernetes.io/docs/reference/node/ is a better home for those details.

Alternatively, you can also write a one-off blog article with some details. Blog articles don't need to be complete and maintaining them after publication is optional. We can then revisit these docs for beta.

Apologies for the confusion, I've renamed the PR right as you were reviewing it. The main purpose doc in this PR is not just introduction to resource drivers, but extending the Kubernetes by building a resource driver, this is very much analogous to the existing device plugins page here: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/.

@sftim
Copy link
Contributor

sftim commented Oct 31, 2023

… The main purpose doc in this PR is not just introduction to resource drivers, but extending the Kubernetes by building a resource driver, this is very much analogous to the existing device plugins page here: https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/.

The existing page about device plugins is OK but there's lots of room to improve it. Essentially, watch out for using early / “good enough” docs as a guide to how new docs should come in.

For a brand new page, we'd like to see more structure. Concept pages should explain what things are; other details belong in task, tutorial or reference pages. And we use hyperlinks to connect the different pieces. This gatekeeping for new content helps drive the overall documentation quality in the direction we'd like to see.

@sftim
Copy link
Contributor

sftim commented Oct 31, 2023

/sig architecture

for the question about whether resource driver names ending with actual DNS names are an expectation or a hint.

@k8s-ci-robot k8s-ci-robot added the sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. label Oct 31, 2023
Copy link

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added few possible text improvement suggestions. Feel free to just ignore/resolve ones you disagree with.

Copy link

linux-foundation-easycla bot commented Nov 2, 2023

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 2, 2023
@byako byako force-pushed the resource-drivers-concept branch from 2060275 to 3957dfb Compare January 26, 2024 08:27
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign katcosgrove for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@byako byako force-pushed the resource-drivers-concept branch from 3957dfb to 794c7b9 Compare January 26, 2024 08:28
@sftim
Copy link
Contributor

sftim commented Feb 9, 2024

/hold cancel

Squash has happened.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 9, 2024
@byako
Copy link
Contributor Author

byako commented Feb 9, 2024

@windsonsea , @divya-mohan0209 , I fixed all review comments and technical review from sig-node was done by @pohly , could you please have a look and LGTM this if you have no further comments, this has been stagnating for a while now.

---

<!-- overview -->
{{< feature-state for_k8s_version="v1.26" state="alpha" >}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the specified feature state (Kubernetes v1.26 [alpha]) accurate as mentioned here? I noticed on another page (Dynamic Resource Allocation page), which is also referenced in the content below, that the state is indicated as (Kubernetes v1.27 [alpha]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feature was in fact added in 1.26:

$ git log v1.26.0 | grep -B4 "kube features: add DynamicResourceAllocation"
commit 155d49813f131a2beb6aeef71270f94b6f5fcf18
Author: Patrick Ohly <[email protected]>
Date:   Tue Mar 22 16:53:58 2022 +0100

    kube features: add DynamicResourceAllocation

But, 1.27 had backwards-incompatible API change, I guess that was the reason to name it as 1.27 in the other page. I'll make it 1.27 here too for consistency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Force-pushed squashed commmits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{{< feature-state for_k8s_version="v1.26" state="alpha" >}} sounds correct. I'm inferring that the first alpha release was v1.26, even if there have been changes since.

https://kubernetes.io//docs/concepts/scheduling-eviction/dynamic-resource-allocation/ is IMO incorrect.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file name _intex.md appears to be incorrect. It would be more appropriate as _index.md to ensure the proper organization of content under the "helper-libraries" section.

@byako byako force-pushed the resource-drivers-concept branch from e08daf0 to 77421cd Compare February 11, 2024 09:23
Copy link
Contributor

@sftim sftim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. It's good to see more detail.


I'm not keen on the suggested glossary entries.

I would drop all of these and add one for “Dynamic Resource Allocation”. I might keep “Resource Driver” but the others are not really right for our docs.

We have an ambition to add fact sheets for API kinds; if / when implemented, those would cover the API kinds for DRA.

@@ -0,0 +1,19 @@
---
title: Resource Claim Parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following our style guide, this would be “ResourceClaimParameters” (no spaces). Same for other API kinds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree about spelling this in text somewhere, but looking at the rest of glossary items, they seem to be spelled with space in title, and with hyphen in id, I just tried to keep it consistent.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • No spaces in titles
  • ID can be either kebab-case or (preferred) matching the API kind

See #43742 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated glossary items.

content/en/docs/reference/glossary/resource-claim.md Outdated Show resolved Hide resolved
@@ -0,0 +1,19 @@
---
title: Resource Claim Parameters
id: resource-claim-parameters
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use ResourceClaimParameters as the id as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be inconsistent with the rest of glossary items.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm saying, as an approver and tech lead for SIG Docs, that what I'm recommending is our ideal practice. We welcome help to bring other glossary items in line with our style guide, but we are not prioritizing that work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change might be causing the build failures. After looking at the {{< glossary_definition >}} (applicable to glossary_tooltip as well) shortcode file, it appears that the shortcode relies on matching the term_id passed with glossary entry filenames to retrieve the description.

With this recent change, the id in the file now reads as "ResourceClaimParameters" while the file name is "resource-claim-parameters.md". Consequently, the shortcode is unable to locate the file to display the necessary data for term_id="ResourceClaimParameters", leading to the build failure. To address this, I suggest temporarily reverting to the existing conventions. However, in the long run, we could potentially enhance the shortcode logic to match the term_id with the id within the glossary file content instead of relying solely on filename.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of reverting the id field, I can just rename the files to match the .md, right ? I'll give it a shot.

Comment on lines 54 to 80
## Implementing controller

Communication between the scheduler and a resource driver is done through
{{< glossary_tooltip term_id="pod-scheduling-context" text="PodSchedulingContext">}} API object. It is
recommended to use [controller helper library](/docs/reference/helper-libraries/dra-driver-controller/),
it implements all needed operations related to PodSchedulingContext, they are common for all resource
drivers.

When using the DRA helper code, resource driver controller has to implement a
[driver interface](https://pkg.go.dev/k8s.io/[email protected]/controller#Driver),
which then can be simply used with
[controller helper instance](https://pkg.go.dev/k8s.io/[email protected]/controller#New).

See the [example resource driver controller](https://github.com/kubernetes-sigs/dra-example-driver/blob/151b7c8da2e620c47a3591e1a937f8d0297b0c25/cmd/dra-example-controller/main.go#L208) for details.

A resource driver controller's main responsibility is to allocate and deallocate resources for
{{< glossary_tooltip term_id="resource-claim" text="ResourceClaims">}}.

There are two modes of allocation the ResourceClaim can have:

- `WaitForFirstConsumer` (default), which you could think of as meaning _delayed_.
In this mode the cluster only requests resource(s)
for ResourceClaim when a Pod that needs it is being scheduled.
- `Immediate`: the resource has to be allocated to the ResourceClaim as soon as possible, and
retained until the ResourceClaim is deleted.

If `allocationMode` is not set explicitly, the mode is `WaitForFirstConsumer`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this level of detail into a separate page within https://kubernetes.io/docs/reference/node/, and write a shorter text about why PodSchedulingContext and ResourceClaim are used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that reference spec of the ResourceClaim object is better placed in reference/node, this is supposed to be an article about extending Kubernetes, how to write a resource driver. Implementing a controller part of a resource driver seems to me very much suitable here. We don't bring in the whole info about the ResourceClaim or PodSchedulingContext in here, this is in fact a short summary, only what's important to keep in mind when writing a resource driver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we're hoping to see in this page is an explanation of the concept of a resource driver. We would place a tutorial about implementing one somewhere else.

There's a lot of detail here which feels like it could be too much someone who first of all wants to learn what DRA is. We need to serve that audience before we serve would-be implementers.

What might work instead: an evergreen blog article with a guide to how to implement a DRA controller.


We can help reshape this PR to match the approach we usually take, as a project, to documenting Kubernetes behavior. I do think this needs those changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, so the "extending Kubernetes" is not for writing extensions, but about customizing the instance of Kubernetes, like installation of device plugins and CRDs, and in our case resource drivers. got it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minimized the chapter.

@byako
Copy link
Contributor Author

byako commented Feb 12, 2024

Thanks for the PR. It's good to see more detail.

I'm not keen on the suggested glossary entries.

I would drop all of these and add one for “Dynamic Resource Allocation”. I might keep “Resource Driver” but the others are not really right for our docs.

We have an ambition to add fact sheets for API kinds; if / when implemented, those would cover the API kinds for DRA.

I'm happy to drop the glossary items when #38682 is implemented.

Co-authored-by: Michael <[email protected]>
Co-authored-by: Evan Lezar <[email protected]>
Co-authored-by: Tim Bannister <[email protected]>
Co-authored-by: Eero Tamminen <[email protected]>
Co-authored-by: Patrick Ohly <[email protected]>
Co-authored-by: Dipesh Rawat <[email protected]>
@byako byako force-pushed the resource-drivers-concept branch from f2d6d04 to b3c0599 Compare February 14, 2024 09:50
@byako
Copy link
Contributor Author

byako commented Feb 14, 2024

Renaming glossary files worked, I squashed commits and force-pushed.

@@ -0,0 +1,8 @@
---
title: Helper Libraries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't find a suitable place that exists, so I created one. It's supposed to contain the prototype reference that the resource driver needs to implement, if I'm not mistaken, it was your idea to move this to the reference section from the main document. Where do you think the helper libraries for writing a resource driver are more suitable?

content/en/docs/reference/glossary/ResourceClass.md Outdated Show resolved Hide resolved
content/en/docs/reference/glossary/ResourceClass.md Outdated Show resolved Hide resolved
In both modes the allocation is preceded by getting parameters objects for ResourceClaims and
ResourceClasses to ensure the resource driver is able to get these objects and understand them.

## Sharing resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move this section much earlier in the page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date: 2023-10-16
full_link: /docs/concepts/extend-kubernetes/compute-storage-net/resource-drivers/
short_description: >
Software extensions to let Pods access devices that need vendor-specific initialization or setup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't sound quite right. You might use a resource driver just for the allocation aspect, without any vendor specific setup.

For example, using DRA to assign a tape robot to a Pod that is running backup software, and exposing a vendor neutral protocol for tape management (eg: iSCSI).

@sftim
Copy link
Contributor

sftim commented Feb 14, 2024

@byako, would you be willing to write a blog article about implementing a resource driver? That would help readers, I think.

@divya-mohan0209
Copy link
Contributor

@sftim: I'm curious whether your suggestion was supplementary to this PR or whether you're recommending this be moved to a blog?

@sftim
Copy link
Contributor

sftim commented Apr 1, 2024

@sftim: I'm curious whether your suggestion was supplementary to this PR or whether you're recommending this be moved to a blog?

We should definitely add something to the Extending Kubernetes pages. However, I think a blog article is the right home for many of the points that @byako wants to make.

Here's why: drivers change. vendors change. the way people manage nodes and their extensions will change.
A blog article will comfortably capture good practice from 2024. Next year the story may vey well be different, and the concept guide will need some updating. DRA in particular is likely to change a lot between now and Kubernetes 1.32

The more that can go into an of-its-time blog article, the better. Not only that, @byako, it lets you tell a story - and I get the sense that you'd like to take readers on a journey with what you're writing.

@byako
Copy link
Contributor Author

byako commented Apr 15, 2024

I've discussed the sensibility of proceeding with this PR with our tech leads responsible for this feature, and the consensus is that we put this on hold because 1.30 and 1.31 bring so much difference to this feature. We'll be back with new PR based on this one later in 2024, close to 1.31 release.

@byako byako closed this Apr 15, 2024
@sftim
Copy link
Contributor

sftim commented Apr 15, 2024

A note about why I was keen on this: it's handy if, when people look at the v1.29 documentation (because that's what they're running), they see details that are right for that version. Updates for v1.30 and v1.31 won't help those readers, in fact they'll make the docs even less relevant for that particular need.

However, if we don't have capacity for this update, so be it. In the long term, we have an answer: v1.29 will eventually not be supported.

@byako
Copy link
Contributor Author

byako commented Apr 16, 2024

I understand, and believe me I wanted to see this merged as well given how much time both - the community and myself put into this. However, you're right - there is less capacity at the moment. There might still some time for a blog article in the summer, with updated feature functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. language/en Issues or PRs related to English language sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/docs Categorizes an issue or PR as relevant to SIG Docs. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Development

Successfully merging this pull request may close these issues.

dynamic resource allocation: "extending Kubernetes" documentation