-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-4815: DRA Partitionable Devices #4874
KEP-4815: DRA Partitionable Devices #4874
Conversation
/cc |
c9c762f
to
a845fd3
Compare
I haven't added any text yet (i plan to do that over the next few days), but I wanted to at least get the proposed API changes along with an example pushed so that those that are familiar with what we want to do here can start taking a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a quick look over this. If I understand correctly, you are allowing the creation of "abstract" devices which are then used as sort of "mixins" in the concrete devices via the "extends" attribute. This allows us to capture various aspects of the general shape of devices, and mix-and-match those aspects in the concrete devices.
The first ~1150 lines of the yaml capture those abstract devices, and the rest capture the many concrete instances across 4 GPUs. Do you have a "back-of-the-envelope" size calculation for the YAML similar to what we did here?
I don't love the "abstract" device thing. It's confusing, IMO. I would prefer a separate data type for the "shapes" rather than re-using "device". What would be the implications of that?
a845fd3
to
9adb6cd
Compare
Yes. That's the idea.
It's actually 8 GPUs.
The file size of the example I included here is 48K. I wil
It shouldn't matter since it is a requirement that all devices included in "extends" be abstract. I actually find it fairly intuitive to categorize them as devices (but mark them as abstract) since they can (and by definition do) include all of the same fields as a device. That said, I'd also be fine moving them under |
a514f0b
to
7ec6e85
Compare
Based on @johnbelamaric's comments I have updated this to include the explicit notion of a Text coming throughout the day and over the weekend... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the "mixin" a lot more, it's easier to understand the relationships between things. I do wonder though if this is just a new DeviceType rather than an extension of BasicDevice.
- name: mig-1g.5gb | ||
- name: memory-slices-0 | ||
consumesCapacityFrom: | ||
- name: gpu-0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You already said that you want to change it, so perhaps this will become clear soon, but let me call it out explicitly here: it's not obvious to me what this gpu-0-mig-1g.5gb-0
device consumes from gpu-0
. Presumably any capacity defined for the device (whether it's in the device itself or in a mixin) gets consumed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as it's defined right now, all capacity must be consumed from somewhere if this field is populated. There is no way to list capacities particular to this device that aren't consumed from elsewhere.
That said, there is (technically) nothing preventing you from listing yourself as a source here. That way you would pull all resources available on GPU 0 from there first, falling back to yourself for those that are only specific to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, there is (technically) nothing preventing you from listing yourself as a source here.
Hmm, isn't consumption said to be recursive? Referencing yourself would lead to infinite recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which might be an important point: allowing a driver to publish something that can lead to infinite recursion should be prevented by validation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to @thockin's comment here:
https://github.com/kubernetes/enhancements/pull/4874/files#r1795977481
Includes[]DeviceMixinRef | ||
|
||
// ConsumesCapacityFrom defines the set of devices where any capacity | ||
// consumed by this device should be pulled from. This applies recursively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "pulled from" implies that there must be a device that can be pulled from. That device itself gets listed and thus can get allocated.
Suppose we have a flat list of identical devices which have some conflicting capacity requirements. We would need to define a fictional additional device that provides this scarce capacity. To Kubernetes, that device looks like something that it can allocate and a request for a "real" device will match it (same driver name, for example).
This is not insurmountable: the fictional device could have "fictional: true" attribute and the DeviceClass selector could exclude devices where that value is true. I just wanted to call this out. Might be worth adding to the KEP as an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is one limitation of this approach (which I thought of but didn't call out explicitly here).
My original proposal (from many months ago) allowed for a separate place to list "shared" capacities that weren't attached to any particular device. Individual devices could then pull rom this list of shared capacities rather than pulling from another device explicitly.
So in the case of GPUs with MIG, we would have one shared capacity per full GPU from which both full GPUs and MIG devices would pull.
@thockin wasn't a fan of this approach though as he thought it seemed too abstract to have a "bag" of resources just declared in a resource slice but not explicitly tied to any particular physical device. At this point, I'm honestly OK with either approach, it's more about what is the common case, and I think needing to pull from some abstract source is actually the outlier, so it's fine to optimize for the more intuitive case.
Regarding having a fictional: true
attribute. The first version of this KEP had something similar with an explicit field of abstract: true
(not an attribute but an actual top-level field in the Device
struct). Having an attribute is just as good (maybe better) though, so long as your DeviceClass
includes the ability to filter out these types of devices from actually being allocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A smaller API is better, so I prefer the approach with the attribute. We just need to explain it because it wouldn't be obvious to driver authors how to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is related to @thockin's comment here:
#4874 (comment)
7ec6e85
to
841efd5
Compare
a60a3e2
to
b9f4313
Compare
0d59a82
to
4c41571
Compare
45cd82a
to
8a6096e
Compare
/approve |
16dc163
to
9ed882e
Compare
/approve |
9ed882e
to
6ed63f0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
from SIG Scheduling
My only concern is from the observability PoV. The question "why did my pod not schedule" is now significantly harder to answer.
Some information about the usage of a ResourceSlice could be helpful, akin to the output of kubectl describe node
:
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 284m (30%) 2045m (217%)
memory 539316480 (18%) 4007175680 (136%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, johnbelamaric, klueska, mrunalp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Maybe some summary with capacity per device and consumed / available will be useful 👍 |
Agreed, though that is out of scope for this KEP |
From what I understand, this information would be available via a ResourceClaim. I don't think we expect users to look into ResourceSlices for scheduling decisions. |
Not necessarily. It could be left for beta, though. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just some nits, almost done.
// | ||
// +optional | ||
// +oneOf=deviceMixinType | ||
Composite *CompositeDeviceMixin `json:"composite,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other mixins do we expect?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure -- @johnbelamaric had the same question. I'm fine removing this level of indirection if that is the consensus.
It just seemed symmetrical with how we treated devices, where a CompositeDevice
can only pull mixins of type CompositeDeviceMixin
. In the future, AdvancedDevice
would only be allowed to pull mixins of type AdvancedDeviceMixin
(which may actually be identical to CompositeDeviceMixin
, but doesn't have to be).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true. Let's keep it. As it's not really a user-facing API (not written by a user, only read for debugging) the extra indirection doesn't hurt that much.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also skeptical that we need to extend here, we can look at it in code review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not opposed to making it flat. It makes it asymmetrical with what we did for devices, but maybe that's OK.
Signed-off-by: Kevin Klues <[email protected]>
6ed63f0
to
53e1584
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
John pointed out that some of my comments were made elsewhere, so I apologize - the thread is super long and I thought it made sense to send these sooner rather than later.
I don't hate this. I actualy find the ideas rather elegant, but there are some things that nag at me, like some semi-unstated property that this asserts which may not be true, but I can't actively disprove.
It assumes that every resource is allocated hierarchically and that every level of that hierarchy is itself allocatable. Again, I can't find an example where it's not true, but it feels like an example MUST exist.
I am OK to pursue this KEP, so the fact that it merged already is OK.
matchAttribute: "gpu.nvidia.com/parentUUID" | ||
``` | ||
|
||
This would result in the following set of non-overlapping partitions: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This case is interesting - it highlights that either users need to be careful to specify things or the specific-device selection logic needs to be smart about seemingly-equivalent choices.
The table below works, but it could just as easily have chosen:
+------------+------------+------------+------------+------------+------------+------------+------------+
| MemSlice 0 | MemSlice 1 | MemSlice 2 | MemSlice 3 | MemSlice 4 | Memslice 5 | MemSlice 6 | MemSlice 7 |
+------------+------------+------------+------------+------------+------------+------------+------------+
| 3g.20gb | X |
+------------+------------+------------+------------+------------+------------+------------+------------+
| X | X | 2g.10gb | X |
+------------+------------+------------+------------+------------+------------+------------+------------+
| X | X | X | X | X | X | 1g.5gb | XXX | <--- Bad! The last 1g.5gb can't be allocated
+------------+------------+------------+------------+------------+------------+------------+------------+
Right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scheduler might attempt to do this, but in doing so it would not be able to allocate all requests. I would then need to backtrack and try a different combination. Specific logic for doing this with MIG devices can be seen here: https://github.com/NVIDIA/mig-parted/blob/main/pkg/mig/config/config.go#L125. We would need to generalize this for the abstractions being introduced here.
// | ||
// +optional | ||
// +oneOf=deviceMixinType | ||
Composite *CompositeDeviceMixin `json:"composite,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also skeptical that we need to extend here, we can look at it in code review
// | ||
// +optional | ||
// +oneOf=deviceType | ||
Composite *CompositeDevice `json:"composite,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For later: I am not in love with "compsite" here - it sounds like "compound" which we used (informally) to describe a different idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah. I originally called it PartitionableDevice
but that put the emphasis on the the consumeCapacityFrom
field and ignored the new mixin
field.
I'd love it if we could just add both of these new fields to the existing BasicDevice
type (since they are strictly extensions), but as @pohly keeps pointing out (because I keep forgetting) we can't extend the API in this way because it breaks downgrades. An old scheduler will not recognize the new fields and attempt to allocate a device based solely on the fields that it DOES know about. For mixins / attributes this is not too bad because selection would simply fail if the scheduler was not able to see the additional attributes that a mixin provides. Not honoring consumesCapacityFrom
properly would be unacceptable though.
// | ||
// The propertes of each included mixin are applied to this device in | ||
// order. Conflicting properties from multiple mixins are taken from the | ||
// last mixin listed that contains them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we validate that they don't collide? Or is there a useful reason to allow it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we restrict mixins to th scope of a ResourceSlice
rather than the scope of an entire pool
(which may span multiple resource slices, they yes, I think we can do this validation. In conversations with @pohly I think we want to make this restriction anyway to make validation easier in general. It means that some mixins may be duplicated across multiple resource slices, but I think that is probably OK for the sake of validation.
// The maximum number of attributes and capacities combined is 32. | ||
// | ||
// +optional | ||
Capacity map[QualifiedName]DeviceCapacity `json:"capacity,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to put the ConsumesCapacityFrom
inside DeviceCapacity? or in a parallel map?
quantity: 40Gi | ||
devices: | ||
- name: gpu-0 | ||
composite: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we don't really even need the composite indirection? What if we just extended basic
? As we move beyond alpha, we can consider whether we even REALLY need basic
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this goes back to the comment above...
I'd love it if we could just add both of these new fields to the existing BasicDevice type (since they are strictly extensions), but as @pohly keeps pointing out (because I keep forgetting) we can't extend the API in this way because it breaks downgrades. An old scheduler will not recognize the new fields and attempt to allocate a device based solely on the fields that it DOES know about. For mixins / attributes this is not too bad because selection would simply fail if the scheduler was not able to see the additional attributes that a mixin provides. Not honoring consumesCapacityFrom properly would be unacceptable though.
string: GPU-662077db-fa3f-0d8f-9502-21ab0ef058a2 | ||
``` | ||
|
||
As you can see, three mixins are created called "system-attributes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At some point we talked about per-slice attribs thsat would be "inherited". This is similar but more explicit (both good and bad aspects). If we know EVERY device will have common things, my name them in each Device? It's sort of noisy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to see an example where mixins were not identical in every device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the real-world MIG use case (later in the KEP), only one mixin (with 3 attributes) is shared by all of the devices in the slice. The vast majority of mixins are only referenced by a subset of devices.
In this example, five devices are defined: a full GPU called "gpu-0" and four | ||
partitions of "gpu-0" called "gpu-0-partion-0", "gpu-0-partion-1", | ||
"gpu-0-partion-2", and "gpu-0-partion-3" respectively. The full GPU has 40Gi of | ||
memory available, and each of the partitions pull 10Gi from it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we plan to validate that the sum of 1st level children is less than the parent? E.g. Could I define gpu0 with 40Gi and 4 children each with 20Gi ? That could be an error (e.g. cut-paste) or it could be that they have some resources "dedicated" to them that are not in the parent or it could some sort of fungible pool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the children declare a capacity "foobar" which is not named in the parent? Is that an error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a need for top-level resources from which children can consume but are not themselves allocatable? e.g. This card has 8 frobnicators and each child device needs 1 or 2, but there's no allocatable device with 8?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember -- no one is defining these resource slices by hand (the driver publishes these), so the likelihood of a cut-paste-error is minimal. I don't think its worth trying to define semantics that would enable us to recognize that "gpu0 with 40Gi and 4 children each with 20Gi" is an invalid configuration. Driver implementors should be responsible for this themselves.
What if the children declare a capacity "foobar" which is not named in the parent? Is that an error?
This would be hard to validate if the parent and child are in different resource slices. I think we can (and should) do our best to validate it in cases where they are in the same slice, but not error out if they are not.
Is there a need for top-level resources from which children can consume but are not themselves allocatable? e.g. This card has 8 frobnicators and each child device needs 1 or 2, but there's no allocatable device with 8?
Yes, there are use cases for this. In fact, this is the exact use case @byako pointed out for SR-IOV VFs. You can achieve this in the current design by defining your DeviceClass
such that the top level device is never selectable, but its a bit back-handed since this is not encoded anywhere in the ResourceSlice
API itself.
We toyed with the idea of adding an extra field to a device to accommodate this case (e.g. abstract: true
or allocatable: false
), but never made it part of the design. I think the conclusion was that we could rely on DeviceClasses
for now and think more deeply about this later.
Happy to continue discussing during implementation though.
quantity: "1" | ||
memory: | ||
quantity: 40Gi | ||
memorySlice0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love this aspect. At some point we talked about "named mutexes" or something, so this would looke more like:
attributes:
...
capacity:
memory:
quantity: 40Gi
consumes:
- memorySlice1
I guess it's somewhat minor -- it has to be listed SOMEWHERE. It just feels
weird to be in capacity, as if someone would want to write a claim against
it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed -- but this is where we are at the moment with the design.
I had also talked about introducing an intRange
abstraction at one point so that one could directly write:
consumes:
memorySlices: "0-3"
but that had it's own problems.
string: NVIDIA A100-SXM4-40GB | ||
type: | ||
string: mig | ||
- name: memory-slices-0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The next ~100 lines of YAML really suggest to me that we're doing somethign wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this would be unnecessary if we had an intRange
or a set
abstraction to work with. It only looks as nasty as it does because we need some way of encoding an explicit "set" of resources, but all we have at our disposable is the ability to define a "quantity" of resources.
For example, this mixin:
- name: memory-slices-0-3
composite:
capacity:
memorySlice0:
quantity: "1"
memorySlice1:
quantity: "1"
memorySlice2:
quantity: "1"
memorySlice3:
quantity: "1"
would turn into a specific capacity
entry on the device itself of something like:
capacity:
memorySlices: [0,1,2,3] // if we supported sets
or
capacity:
memorySlices: "0-3" // if we supported intRanges
There would be no need for this long list of mixins defining valid memorySlice combinations at all.
Given that this is what we have to work with for now, I think it's acceptable, but we should look into writing a future KEP that introduces set
and/or intRange
as a usable abstraction here.
Thanks Tim. @klueska, @catblade I read Tim's input as saying he is OK with abstract pools :). And figuring out how to define local capacities above and beyond pooled capacities. One other thing he and I discussed is the idea of adding back in one level of inheritance. I didn't like the original proposal because the complex tree was hard to parse in your head. But we probably could allow the user to define the mixins, and ALSO then to use those mixins to define shapes. Both of those are abstract, just like today. A shape is just a named collection of mixins. But it's not a tree. Then, an individual device has a shape, plus locally defined attributes and capacities (no mixins directly in the devices, I think?). Our device list then becomes more compact, you don't have to repeat the mixins over and over. But it's just one level deep, not N levels deep. WDYT? |
I would want to see what exactly that means before signing off on it :)
What I meant was more like this:
|
I think a "shape" concept is useful, it avoids the repeated includes for every device of the same shape. |
Too many levels of indirection will make things harder, not easier. I am somewhat skeptical that we need need both shapes and mixins |
I'm thinking of this in terms of set notation. Trying to sort out what that looks like. IE item A is part of set C, but set C also includes D, E, and F. I want to be able to ask for a C, but I also want to be able to ask for an A if I absolutely need an A. I'll look at this later next week and see if I can make sense of it. What number of levels of abstraction can I get? |
In my experience there are very few attributes that actually get referenced by all devices, so I' don't think introducing a "top-level" attributes concept is worth it. The other bullet points are already supported in the current design. |
One-line PR description: Adding new KEP for Partitionable Devices in DRA
Issue link: DRA: Add support for partitionable devices #4815
Other comments: