-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC][TIR] Layout transformations on buffer access #39
Conversation
Thanks @Lunderberg for the RFC. Logical-physical mapping is definitely an important feature. I also implemented something similar for warp memory to support tensor core instructions on GPU, I'm happy to collaborate more to get an unified design. |
Thank you for the comments, @vinx13.
Never mind, I'm convinced after going through your point below about CUDA's permuted layout.
Thank you for the example, and this sort of example was what I was hoping to get in order to determine if the representation of reorder/splits is sufficiently flexible. If I'm reading the link correctly, I agree that the physical layout in shared memory cannot be expressed in the proposed notation. If the global memory is 2-d memory I am convinced, while the earlier notation may be useful as a more readable input in some cases, it doesn't cover all use cases. At most, it could be an alternative form of input, but the function definition mapping from logical indices to physical indices should be the primary representation. In terms of how to represent this in the TIR graph, I'll have to think a bit on it. I'm picturing either
Thank you for mentioning these. I intend to use these utilities for the implementation, and will add it to the detail section. |
One way to represent the layout mapping in TIR is to introduce different storage scopes and have a registry of pre-defined layout mapping (for example, we already did similar thing for |
True, and that would allow both for user-defined mappings, and for specifying standard layouts. I have a bit of concern with using the I really like that idea of building the TIR representation based on the provided function. Like with Edit: The signature could also be made more similar to |
How will vectorization work? If there is a |
While in principle a vectorized load/store could fallback to a non-vectorized load/store, this would result in ignoring the This restriction could then be loosened by allowing schedules to be expressed in terms of the physical layout, rather than the logical layout. I haven't thought through the full implications of this, which would likely involve either introducing an inlined Stage whose axes are the physical axes, or by exposing the physical axes as a separate parameter within the same stage. |
Following a video chat with @csullivan, documenting some of the key points of the conversation.
|
Following discussion with @tqchen , this RFC has had significant updates made. The major change is that instead of extending the capabilities of |
rfcs/0039-buffer-physical-layout.md
Outdated
- If a series of nested loops contains a `cache_read` or | ||
`cache_write` stage, can these be recognized and reordered? | ||
|
||
- Option 3: Expose the `reorder_split` definition to be used as part |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you also explain how it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In TE, the transformed axes would be defined in the return value from s[A].transform_layout(...)
, so that the remainder of the TE schedule can refer to .op.axis
of the return value. (Updating the wording in the RFC shortly.)
In TensorIR, I think it would be done similarly, with AA = T.transform(A, lambda ...)
returning the transformed buffer. The loops in the schedule could then be written over T.grid(*AA.shape)
. I'm not as familiar with tvm.script.tir
, so does that sound reasonable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds reasonable, this will allow scheduling on the transformed axes
I'd suggest adding the |
Sounds good, and I've added a description of it, and a possible data structure for it. |
Thanks for updating the RFC. Here are some follow-up thoughts: Usage of
The new schedule primitive |
That's correct, the For the axes returned, I'd want to add a third option.
Then my thoughts on each option are below, with T2 as my preferred option.
When the transformation is performed, would that include fusing axes together? If it does, then the I like the idea of having the transformation be done eagerly, to avoid needing to pass through additional information, but I'm not quite seeing how it could be done. (I had been trying an initial implementation that way on the TE side.) |
Thanks for adding the discussion points. I understand the difficulty implementing it as eager transform in TE, mainly because most other schedule primitives were not done eagerly as in TIR. So adding a rewrite pass for
When adding
My thoughts on TIR and |
I had initially placed On further thought, this potential future extension would still be possible with the definition in the Having a single global definition of the buffer transformations would also make the order clearer in cases of multiple transformations. A transformation may be more easily expressed multiple sequential transformations, such as an initial transformation of a matrix into tiles for transfer to a GPU, then another transformation for ordering the elements into groups of size
That would minimize the number of changes to the TIR, though it would increase the number of distinct changes that are made while lowering. I had been thinking of the flattening to the underlying physical layout requirement as a special case of a transformation, which happens to define a row-major traversal. Good point, though, on how the current flattening depends on the parameters outside of the buffer itself. As such, the transformation representing the flattening couldn't be generated initially. The concepts could be combined at some point in the future, if the flattening is made to depend only on the buffer, but that should be a separate change. Between those two points, I think that would be that |
Following a video chat discussion with @vinx13 , we touched on a number of points, summarized below. Also, we are adding @vinx13 as a co-author on this RFC.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Lunderberg did a pass through this, a couple comments
doesn't hold in all cases. For example, texture memory on a GPU | ||
requires two indices to access. | ||
|
||
In addition, computations that are semantically identical (e.g. 2-d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, but this is kind of the core piece of TVM today. could you elaborate why this impacts this RFC in particular? (i suspect it has something to do with moving the logical shape information, but want to clarify)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That part was intended to lead into the "including the option to present multi-dimensional indices to the low-level code generators" phrase a few paragraphs down. That the first two paragraphs in the Motivation section describe portions of the TVM codebase that could be improved, and the third indicates how these improvements would be made.
I'll add a clarifying sentence "These are currently handled on a case-by-case basis, such as using tvm::tir::builtin::texture2d_store
in a CallNode
." so that the first paragraph connects better with the third.
}; | ||
``` | ||
|
||
- After applying the transformations, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will it then be impossible to trace this backwards in a TIR program lowered through the compiler? how can we debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debugging would be done by tracing changes forward during the compilation process, such as with tvm.ir.instrument
to view the before and after changes made from each lowering pass. I think this is the preferred direction for debugging, as it also handles existing cases, such as Prefetch
nodes that are lowered to for-loops that implement the prefetching.
My concern with leaving the transformation attribute defined after usage is that the validity of an IRModule, or even what computation it will produce, then depends on where that IRModule occurs in the lowering process. Examples of each are below.
-
Ambiguously valid: Consider a transformation from NCHW to NCHWc layout,
lambda n,c,h,w: [n, c//4, h, w, c%4]
. This transformation is only valid to apply on a rank-4 tensor, and produces a rank-5 tensor. If the annotation is kept in the TIR graph following the transformation, then an invalid graph (incorrectly attempting to apply the transform on a rank-5 tensor) and a valid graph (already having applied the transform onto a rank-4 tensor, producing a rank-5) would have the same representation. -
Ambiguous computation: Consider a transformation from NCHW to NHWC layout,
lambda n,c,h,w: [n,h,w,c]
, and the following psuedo-code compute definition.buf.shape = [A,B,C,D] buf.layout_transforms = [lambda n,c,h,w: [n,h,w,c]] for a in range(A): for b in range(B): acc = 0.0 for c in range(C): for d in range(D): acc += buf[a,b,c,d] for c in range(C): for d in range(D): buf[a,b,c,d] = buf[a,b,c,d] / acc
If the
layout_transforms
annotation is left in the TIR graph after being applied, then the above compute definition could be performing normalization along the height/width dimensions if the transformation hasn't yet been applied, or along the width/channel dimensions if the transformation has been applied.
rfcs/0039-buffer-physical-layout.md
Outdated
the `"buffer_axis_separators"` attribute of a primfunc. All | ||
buffers whose `BufferNode::data` is a key in this map should be | ||
flattened to an output buffer of dimension | ||
`separators[buf->data].size()+1`. All other buffers should be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the +1 related to the buffer_axis_separator? i'm not quite grasping the value of this map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This map is what determines the rank of buffers, and the number of indices, that are presented to the low-level code generator. Rather than always flattening all input buffer axes into a single output buffer axis, the axis separators define groups of input buffer axes, and each group is flattened into an output buffer axis.
The +1
is because having N
dividers between items in a list produces N+1
groups of items.
rfcs/0039-buffer-physical-layout.md
Outdated
# After flattening to 1-d | ||
x = Var(name="x") | ||
with Allocate(x, shape=[3*2]): | ||
val = BufferLoad(x, [15*2 + 10]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one thing i've always wished was a bit easier to decipher was these address computations--not because you can't decipher them, but because you're almost always interested in something else, and having to push that thing on the "stack" while you decipher the address computation is distracting. shall we introduce a type to represent these which could output additional debug info in codegen?
# Drawbacks | ||
[drawbacks]: #drawbacks | ||
|
||
This change may make it more difficult to reason about the memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is my main concern with this proposal. we should be careful that we make sure we keep the same level of clarity around which memory access pattern a schedule will trigger. while i think that efforts such as AutoScheduler will already make this more challenging than it is today (and therefore, this concern isn't confined to this proposal), we should make sure we evaluate this each time we complicate things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed entirely. My goal has been to have the default behavior, both for TE and for TIR, remain the same. The differences in memory access patterns would only occur when explicitly opted into, with Schedule.transform_layout
in TE, or with the annotations in TIR.
- If a series of nested loops contains a `cache_read` or | ||
`cache_write` stage, can these be recognized and reordered? | ||
|
||
- Option 3: Expose the transformed axes to be used as part of a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this mean you'd write the schedule in terms of the physical layout? if so then many of my previous concerns about debuggability go away.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, yes. The transform_layout
method will return an object representing the transformed tensor. Later calls, such as s.vectorize
or s.reorder
, will accept the axes of the transformed tensor.
the api side of |
Since we decided against RFC#0042, the reference is no longer needed here.
Also, removed implementation details that didn't match the implementation. A new stage wasn't necessary, and the loop iterators could instead be updated in the existing stage, similar to `fuse()` and `split()`.
Sounds good! I've updated with examples of scheduling with the returned new axes, which work in the implementation posted in PR#9727. |
@Lunderberg Hi, I am much interested in We have implemented some TensorIR primitives to serve similar purposes in form below to mutate the s.replace_buffer_obj(block, write_buffer_idx, *a set of rewrite callbacks) Since generally all buffer accesses are multi-dimensional in TensorIR schedule phase, the implementation is a bit easier (just something like a pass to substitute the buffer object and do not require full pipeline modifications) than in TE, if no extra representative variables are introduced. Is the s.transform_layout(block, buffer_idx, remap_func) Another form we use is just a duality of loop transformation primitives, where representative variables for buffer or axis of buffer are tracked into TensorIR schedule states. n, c, h, w = s.get_buffer_axes(block, buffer_idx)
c_outer, c_inner = s.split_for_buffer(c, factor=8)
s.reorder_for_buffer(n, c_outer, h, w, c_inner)
# for `transform_layout` look like this?
buffer_rv = s.get_buffer(some identifier)
new_buffer_rv = s.transform_layout(buffer_rv, remap_func) Is it possible to provide both integrated Very glad to know your opinions! :) |
Initial example used a buffer shape of `[2,3]`, which was smaller than the indices used in the example.
@wrongtest I'm working on the TensorIR side and have a draft version of |
The `te.compute` function can be used to define an updated layout. | ||
However, this introduces a new tensor that must be inlined to avoid | ||
additional memory allocation, and cannot be used for input | ||
parameters. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading this, I realized that your proposal might solve the issue of N-d input dense
/ batch_matmul
support discussed in apache/tvm#8412 and https://discuss.tvm.apache.org/t/relay-nn-does-relay-nn-dense-supports-multi-dimensional-input/10343/7. The natural approach to enable N-d support is to wrap the existing 2-d compute / schedule definitions with reshape
, and eliminate explicit reshape
via compute_inline
.
So we will have topi.reshape (N-d to 2-d) -> 2-d dense -> topi.reshape (2-d to N-d)
. While the first reshape
can be trivially eliminated, the second one cannot (TensorIR has reverse_compute_inline
, which may work for the second reshape
). Currently, topi.reshape (N-d to 2-d) -> 2d dense -> topi.reshape (2-d to N-d)
pipeline does not work because TE bound inference fails to infer the precise bound for the 2-d write buffer of the dense
schedule. I don't fully understand the reason, but it seems the failure stems from the fact that the output buffer is N-d while the local write buffer is 2-d.
Since the underlying physical memory layout doesn't change before and after reshape (it just collapses multiple "batch" dimensions into one), libraries like cudnn and mkl, that operates on pointers, supports N-d dense
out of the box. So introducing explicit reshapes and eliminate them after-the-fact don't feel right. I've been bothered with this problem for some time, but transform_layout
sounds like a perfect solution. Do you think transform_layout
will enable supporting N-d dense
and batch_matmul
using existing 2-D schedules?
Note that the exact same problem arises in im2col based convolution: We need to "reshape" 2-D GEMM output of shape (N * H * W, C) to NHWC layout. Since the underlying physical memory layout doesn't change, explicit reshape should never be necessary. Overcoming this problem is a prerequisite for implementing tensorcore-based conv2d kernels based on the implicit GEMM algorithm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the application, and I think that it would be able to apply in this case with relatively small changes. I had mainly been considering cases where the computation being performed is semantically identical, but with a different underlying data layout. (e.g. conv2d_nhwc and conv2d_hwcn) This is the reverse case, where the computation being performed is semantically different (e.g. "matrix multiply using indices 3/4 of a 4-d tensor" as opposed to "matrix multiply using indices 2/3 of a 3-d tensor"), but has an identical underlying data layout, and should share the same schedules.
Implementing it would have two steps. The first would allow the topi.nn.batch_matmul
to produce the different compute definitions for each tensor dimension, and the second would schedule any of those different compute definitions with the same shared function.
-
Modifying the compute definition to accept N-d tensors.
- Extracting the tensor shapes as
*other_dim, N, M = tensor.shape
instead ofbatch, N, M = tensor.shape
. - Defining the compute definition as
lambda *indices: A[(*indices[:-2], indices[-2], k)] * B[(*indices[-2], k, indices[-1])]
instead oflambda batch, i, j = A[batch, i, k] * B[batch, k, j]
. (I made an improved parameter inspection for.transform_layout
that, if extracted out to be used byte.compute
could improve the readability tolambda *other_indices, i, j = A[(*other_indices, i, k)] * B[(*other_indices, k, j)]
.)
- Extracting the tensor shapes as
-
Modifying the schedule definition to accept N-d tensors.
- Defining layout transformation function
def flatten_all_but_last_two(*other_indices, i, j): batch = 0 for dim,index in zip(C.shape, index): batch = dim*batch + index return [batch, i, j]
- Extracting the output tensor axes as
b, y, x = s[C].transform_layout(flatten_all_but_last_two)
instead ofb, y, x = s[C].op.axis
.
- Defining layout transformation function
That said, as I was thinking through it, I don't think it strictly requires the transform_layout
support in order to be implemented. Since there's no change to the underlying data layout, generalizing the schedule could be done by changing the loop iteration order, which can already be done with Stage.fuse
.
- Modifying the schedule definition to accept N-d tensors.
- Extracting the output tensor axes as
*other_axes, y, x = s[C].op.axis
instead ofb, y, x = s[C].op.axis
. - Fusing the batch axes as
b = s[C].fuse(*other_axes)
.
- Extracting the output tensor axes as
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That said, as I was thinking through it, I don't think it strictly requires the
transform_layout
support in order to be implemented.
Thanks, I've been thinking exclusively about using 2-d compute / schedules as they are with reshape + inline
, so this simple solution didn't really occur to me. Partly because I was (1) lazy and (2) not sure if we can express "N-d indexing" with TE (like A[(*indices[:-2], indices[-2], k)]
you showed above). I did try b = s[C].fuse(*other_axes)
thing because that's required anyway for the reshape-based path as well, and that's where I hit the issue of imprecise bounds inference (second paragraph in my post above).
I was making a wrong deduction that removing the need for explicit reshape + inline
would let me workaround the limitation of TE bounds inference, and so transform_layout
would "solve" this long-standing issue of mine. But I realized that these are distinct problems and that rewriting the compute and schedule for N-d wouldn't address the underlying problem. Thanks for helping me clearing my thoughts!
all indices as named arguments, or an arbitrary number of indices. | ||
|
||
|
||
- What convention should be used for buffer indexing? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank for the update, I agree with the discussion here. It would be great to also mention the actions needed to the existing passes / codegen, or the plan for the future work
Thank you all for the discussion @Lunderberg @vinx13 @areusch! |
This RFC introduces a hard boundary between the “logical layout” of a mathematical tensor and the “physical layout” of a buffer in memory, along with a specification for defining the conversion between the two.Following discussion, this RFC has been heavily modified to support more flexible layout transformation annotations, not just a single transformation from logical layout to physical layout.
Rendered markdown link