Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC][TIR] Layout transformations on buffer access #39

Merged
merged 20 commits into from
Feb 17, 2022

Conversation

Lunderberg
Copy link
Contributor

@Lunderberg Lunderberg commented Oct 5, 2021

This RFC introduces a hard boundary between the “logical layout” of a mathematical tensor and the “physical layout” of a buffer in memory, along with a specification for defining the conversion between the two.

Following discussion, this RFC has been heavily modified to support more flexible layout transformation annotations, not just a single transformation from logical layout to physical layout.

Rendered markdown link

@tmoreau89
Copy link

CC @kparzysz-quic @manupa-arm

@tmoreau89
Copy link

@vinx13 @junrushao1994

@vinx13
Copy link
Member

vinx13 commented Oct 6, 2021

Thanks @Lunderberg for the RFC. Logical-physical mapping is definitely an important feature. I also implemented something similar for warp memory to support tensor core instructions on GPU, I'm happy to collaborate more to get an unified design.
Some preliminary comments:
The current representation of logical-physical layout mapping is to use an array of axis/factor to define how the logical axes are split/reordered/fused to form the physical axes. This works for the case of packed layout like NCHW4c, but we might need to think whether this is a generic way to represent the mapping. For example, another way is to use a mapping function: (n, c, h, w) -> (n, tir.floordiv(c, 4), h, w, tir.floormod(c, 4)). This would allow arbitrary mapping (we can add more restrictions like requiring affine mapping though, to make analysis easier). A possible use cases of more complex mapping is permuted layout for shared memory on CUDA.
Also, there are related affine analysis infrastructure available, it would be great if we can reuse it for loop analysis and rewriting.

@Lunderberg
Copy link
Contributor Author

Thank you for the comments, @vinx13.

For example, another way is to use a mapping function: (n, c, h, w) -> (n, tir.floordiv(c, 4), h, w, tir.floormod(c, 4)). This would allow arbitrary mapping (we can add more restrictions like requiring affine mapping though, to make analysis easier).

Having an arbitrary mapping function was something that I considered as an alternative, and would like to use as part of the internal representation. However, for the user-facing API, I prefer having a representation that cannot express an illegal internal state, rather than relying on validation. The arbitrary mapping function could have more than one logical index map to the same physical index (e.g. (i,j) -> (i+j,)), which would need to be caught after the fact.

Never mind, I'm convinced after going through your point below about CUDA's permuted layout.

A possible use cases of more complex mapping is permuted layout for shared memory on CUDA.

Thank you for the example, and this sort of example was what I was hoping to get in order to determine if the representation of reorder/splits is sufficiently flexible. If I'm reading the link correctly, I agree that the physical layout in shared memory cannot be expressed in the proposed notation. If the global memory is 2-d memory (i,j), the closest to representing physical layout in shared memory would be either ((i,2), (j,4), (j,8), SEP, (i,4)). This would maintain the 128-bit vectors (j,8), transpose because (j,8) appears in the first physical coordinate, but wouldn't handle the XOR condition needed to have the conflict-free access. It also couldn't specify that the (i,2) in the first physical coordinate would need to take precedence over the (i,4) in the second physical coordinate. In the functional notation, these would be (i,j) -> (floormod(i,2), (floormod(row,8)) ^ (floordiv(j,8)), floormod(j,8), SEP, row), where row = floordiv(i, 2).

I am convinced, while the earlier notation may be useful as a more readable input in some cases, it doesn't cover all use cases. At most, it could be an alternative form of input, but the function definition mapping from logical indices to physical indices should be the primary representation.

In terms of how to represent this in the TIR graph, I'll have to think a bit on it. I'm picturing either BufferNode holding Array<Var> index_vars and Array<IterSumExpr> physical_layout, or making a TIR variant of tvm::relay::FunctionNode.

Also, there are related affine analysis infrastructure available, it would be great if we can reuse it for loop analysis and rewriting.

Thank you for mentioning these. I intend to use these utilities for the implementation, and will add it to the detail section.

@vinx13
Copy link
Member

vinx13 commented Oct 6, 2021

One way to represent the layout mapping in TIR is to introduce different storage scopes and have a registry of pre-defined layout mapping (for example, we already did similar thing for wmma fragments, a special data structure for tensor core input). The cons is that the TIR itself doesn't contain the mapping, the meaning of TIR depends on the contents in the registry. If we can have a few pre-defined mapping that is general to be used in different operators, this might be fine. A possible way to embed the mapping inside TIR is to build the PrimExpr from the mapping function, which is similar to what we did for te.compute

@Lunderberg
Copy link
Contributor Author

Lunderberg commented Oct 6, 2021

True, and that would allow both for user-defined mappings, and for specifying standard layouts. I have a bit of concern with using the scope parameter to also describe layouts, since in my mind "scope" refers to where the entire buffer is located, while "layout" refers to how individual elements are arranged within the buffer. Two buffers that have the same layout and different scopes could be memcpy-ed between the two scopes without additional rearrangement, and that would be obscured if the scope parameter is used to define both the scope and the layout.

I really like that idea of building the TIR representation based on the provided function. Like with te.compute, that gives a convenient way to allow the library to handle the var/primexpr building, so long as the user provides the function to compute the physical indices.

Edit: The signature could also be made more similar to te.compute by having the index variables be passed in individually (e.g. lambda n,h,w,c: [n,c,h,w]) rather than as a single list. This would avoid having unnecessarily different interfaces between the two.

@kparzysz-quic
Copy link

How will vectorization work? If there is a vectorize directive spanning a logical extent, will the vectorization pass create multidimensional ramps? How will vector loads and stores be represented?

@Lunderberg
Copy link
Contributor Author

How will vectorization work? If there is a vectorize directive spanning a logical extent, will the vectorization pass create multidimensional ramps? How will vector loads and stores be represented?

While in principle a vectorized load/store could fallback to a non-vectorized load/store, this would result in ignoring the vectorize directive entirely, which would be rather unexpected from the user's perspective. Therefore, my plan for the initial implementation is to require the elements to be contiguous in both the logical and physical layouts if they are vectorized. In both the physical and logical layouts, the vector load/store would be represented using a RampNode with stride==1.

This restriction could then be loosened by allowing schedules to be expressed in terms of the physical layout, rather than the logical layout. I haven't thought through the full implications of this, which would likely involve either introducing an inlined Stage whose axes are the physical axes, or by exposing the physical axes as a separate parameter within the same stage.

@Lunderberg
Copy link
Contributor Author

Following a video chat with @csullivan, documenting some of the key points of the conversation.

  • Setting the physical layout in a TE-based schedule has two roles. One is the rewrite the buffer itself, and the other is to define the order of iteration when writing to the buffer. In the latter use case, the schedule should have access to the physical axes for use in the schedule.
  • Setting the physical layout is roughly equivalent to using an inlined te.compute to define a re-shaped tensor. However, using te.compute requires the compute definition to change, whereas set_physical_layout is entirely contained in the schedule.
  • For a given tensor, either the physical axes or the logical axes may be used in scheduling, not both. We were unable to think of use-cases where you would want to change the physical layout, but still maintain the loop-order of the logical layout. One idea that came of this conversation was to have set_physical_layout act similar to cache_read and cache_write, which introduce a new stage.
  • If a physical layout is defined for a cache stage, then the generated code should perform the layout transformation as part of generating that cache.

@Lunderberg
Copy link
Contributor Author

Following discussion with @tqchen , this RFC has had significant updates made. The major change is that instead of extending the capabilities of Store and Load nodes to support N-d indices, they would instead be removed in favor of keeping BufferStore and BufferLoad nodes throughout the lowering process.

@Lunderberg Lunderberg changed the title [RFC][TIR] Separate physical and logical layout of buffers [RFC][TIR] Layout transformations on buffer access Oct 28, 2021
- If a series of nested loops contains a `cache_read` or
`cache_write` stage, can these be recognized and reordered?

- Option 3: Expose the `reorder_split` definition to be used as part
Copy link
Member

@vinx13 vinx13 Oct 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you also explain how it works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In TE, the transformed axes would be defined in the return value from s[A].transform_layout(...), so that the remainder of the TE schedule can refer to .op.axis of the return value. (Updating the wording in the RFC shortly.)

In TensorIR, I think it would be done similarly, with AA = T.transform(A, lambda ...) returning the transformed buffer. The loops in the schedule could then be written over T.grid(*AA.shape). I'm not as familiar with tvm.script.tir, so does that sound reasonable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds reasonable, this will allow scheduling on the transformed axes

rfcs/0039-buffer-physical-layout.md Outdated Show resolved Hide resolved
rfcs/0039-buffer-physical-layout.md Outdated Show resolved Hide resolved
@vinx13
Copy link
Member

vinx13 commented Oct 29, 2021

I'd suggest adding the BufferTransform data structure here which will be very helpful to other audience.

@Lunderberg
Copy link
Contributor Author

I'd suggest adding the BufferTransform data structure here which will be very helpful to other audience.

Sounds good, and I've added a description of it, and a possible data structure for it.

@vinx13
Copy link
Member

vinx13 commented Nov 2, 2021

Thanks for updating the RFC. Here are some follow-up thoughts:

Usage of te.AXIS_SEPARATOR: It seems this is only used in the API side but not in BufferTransform, would be good to get some clarification. Also I could see some tradeoff here that worth discussions:

  • T0: using te.AXIS_SEPARATOR to separate groups of axes, e.g. lambda m,n,p,q: [m, n, te.AXIS_SEPARATOR, p, q]. The benefit is writing m, n directly implies fusion of m and n.
  • T1: each element is the returned array of the lambda is an axis. The benefit is that this looks simpler, but when fusion is needed (e.g m, n in the above example), we need to manually write down the affine mapping (m * length_of_n + n) and make sure it is correct.

The new schedule primitive buffer_transform:
From TIR's perspective, the same primitive can be introduced to TIR as well. It will be an eager transformation (without needing to create BufferTransform node) that rewrite the both producer/consumer immediately upon calling it. This is required if we want to further scheduling the transformed block. I can follow up implementing it in TIR later

@Lunderberg
Copy link
Contributor Author

Usage of te.AXIS_SEPARATOR: It seems this is only used in the API side but not in BufferTransform, would be good to get some clarification.

That's correct, the te.AXIS_SEPARATOR only appears in the API for the TE schedules, and not in the TIR graph generated from the TE schedule. I've updated the RFC with a paragraph on this distinction.

For the axes returned, I'd want to add a third option.

  • T2: Using te.AXIS_SEPARATOR to separate groups of axes, e.g. AA = s[A].transform_layout(lambda i, j: [j//4, te.AXIS_SEPARATOR, i, j%4]), directly implies fusion of the i and j%4 axes, but exposes each transformed axis for use in the TE schedule. That is, AA.op.axis has three elements, whose iteration bounds are [A.shape[1]//4, A.shape[0], 4].

Then my thoughts on each option are below, with T2 as my preferred option.

  • T0: This is easy to write in te, but any additional manipulation of the axis (e.g. changing the loop ordering independent of the data layout) would require splitting the returned axis to match the layout transformation. It would also mean that lambda *indices: indices isn't the identity function, because it would flatten the buffer as part of that transformation.
  • T1: As you mentioned, this would require explicitly writing down the transformation to fuse the axes, which would be error-prone. It would also make the default behavior, where all buffer axes are fused into a single index, be more of a special case.
  • T2: Doesn't require the user to specify the transformation to fuse the axes, and allows the corresponding loops to be easily reorganized. The fusion in the default case, where no transform_layout is applied, then follows the same pattern, where a BufferTransform node is inserted to fuse all remaining axes into num_axis_separators+1 == 1 axes.

From TIR's perspective, the same primitive can be introduced to TIR as well. It will be an eager transformation (without needing to create BufferTransform node) that rewrite the both producer/consumer immediately upon calling it. This is required if we want to further scheduling the transformed block. I can follow up implementing it in TIR later

When the transformation is performed, would that include fusing axes together? If it does, then the FlattenBuffer pass would need distinguish between a 2-d buffer that should be flattened to a flat 1-d buffer, and a 2-d buffer that has already been flattened from some larger dimension. If it doesn't, then the FlattenBuffer pass would need to distinguish between a N-d buffer that should be flattened to a 1-d buffer, and an N-d buffer that should be flattened to a 2-d buffer. In either case, FlattenBuffer would need some additional metadata to know how the buffer should be handled.

I like the idea of having the transformation be done eagerly, to avoid needing to pass through additional information, but I'm not quite seeing how it could be done. (I had been trying an initial implementation that way on the TE side.)

@vinx13
Copy link
Member

vinx13 commented Nov 3, 2021

Thanks for adding the discussion points.

I understand the difficulty implementing it as eager transform in TE, mainly because most other schedule primitives were not done eagerly as in TIR. So adding a rewrite pass for BufferTransform makes sense to me.

Should BufferTransform apply only to its body, or apply to the entire graph it is contained in?
Option 2 is preferred.

When adding BufferTransform as a statement, we will need to make sure the semantic is clear. In the provided example, adding BufferTransform immediately after Allocate has clear meaning of Option2 that it should be applies to everywhere the buffer appeared. What if BufferTransform appear at the middle, e.g. Allocate -> some buffer accesses -> BufferTransform -> some another buffer accesses. Since Option2 suggests the transform is global, shall we consider BufferTransform being part of function attribute?

When the transformation is performed, would that include fusing axes together?

My thoughts on TIR and FlattenBuffer is that buffer_transform will return unfused axes, and FlattenBuffer should decide how to fuse them according to buffer scope or other auxiliary information. This implies separation between buffer transform and the underlying physical layout requirement (e.g. 1-D by default in most cases). I think the idea here is close to T2. Using AXIS_SEPARATOR is also a potential way to specify the behavior of FlattenBuffer. On the other hand, we might also need deliberation on whether buffer transform and flattening to physical buffer should be coupled together.

@Lunderberg
Copy link
Contributor Author

Since Option2 suggests the transform is global, shall we consider BufferTransform being part of function attribute?

I had initially placed BufferTransform as a statement so that it could be possible to extended it to have a transformation defined by references to variables within the function body. This would allow other transforms that require rewriting memory layouts to add a BufferTransform defining the layout, relying on a later pass to implement the transformation itself. For example, tir.transform.InjectDoubleBuffer could be implemented by adding a buffer transform lambda *indices: [loop_iter%2, *indices], and the load/store rewrites of tir.transform.VectorizeLoop could be implemented by adding a buffer transform lambda *indices: [*indices, loop_iter].

On further thought, this potential future extension would still be possible with the definition in the PrimFunc, so long as visitors that modify the loop_iter also modify the transformation, so I support having the transformations as a global definition, and will update the RFC.

Having a single global definition of the buffer transformations would also make the order clearer in cases of multiple transformations. A transformation may be more easily expressed multiple sequential transformations, such as an initial transformation of a matrix into tiles for transfer to a GPU, then another transformation for ordering the elements into groups of size warp_size. If these are in different portions of the TIR call graph, the order of these transformations would depend on the traversal order.

This implies separation between buffer transform and the underlying physical layout requirement (e.g. 1-D by default in most cases).

That would minimize the number of changes to the TIR, though it would increase the number of distinct changes that are made while lowering. I had been thinking of the flattening to the underlying physical layout requirement as a special case of a transformation, which happens to define a row-major traversal.

Good point, though, on how the current flattening depends on the parameters outside of the buffer itself. As such, the transformation representing the flattening couldn't be generated initially. The concepts could be combined at some point in the future, if the flattening is made to depend only on the buffer, but that should be a separate change.

Between those two points, I think that would be that PrimFuncAttrs::attrs should include a map from buffers to a list of layout transformations, and a map from buffers to a list of axis separator locations. The layout transformations would be consumed by the pass that rearranges the layout, and the list of axis separator locations would be consumed by either StorageFlatten or FlattenBuffer.

@Lunderberg
Copy link
Contributor Author

Following a video chat discussion with @vinx13 , we touched on a number of points, summarized below. Also, we are adding @vinx13 as a co-author on this RFC.

  • Are there cases where the flattening in StorageFlatten/FlattenBuffer should be inferred from buffer properties, rather than explicitly specified by the user? For example, if a buffer has "texture" scope, then we know it must be flattened to a 2-d buffer. We concluded that this wouldn't be possible, because the number of resulting dimensions isn't sufficient to define the flattening being applied. For example, if a 4-d buffer is being flattened to 2-d for use in texture memory, the four initial axes [A, B, C, D] could be flattened to [A, fuse(B,C,D)], [fuse(A,B), fuse(C,D)], or [fuse(A,B,C), D], without any clear method that is better or worse.

  • How will buffer layout transformations be represented in TensorIR schedules? buffer_transform will be a primitive transformation in TensorIR, which is eagerly applied on the TensorIR computation.

    • In all cases, this would rewrite the buffer shape, and would rewrite loads/stores of that buffer.
    • If these loads/stores occur within a series of nested loops that cover all values of the buffer, and have no additional computation (e.g. cache read/write) in the body of these loops, then the loops will be rewritten to be along the transformed axes. can write remainder of schedule in terms of the transformed axes. Otherwise, rewriting the loops would not be well-defined, and will not be done.
    • The recommendation for use will be to apply the layout transformations prior to any other scheduling passes that could impact the loop structure, so that rewriting of the loops is possible.
  • Should buffer flattening be implemented as a special case of layout transformation? Buffer flattening should remain a separate concept from the layout transforms. Where all other layout transformations can be performed eagerly, and should be before other scheduling passes, buffer flattening must be performed after other scheduling passes. If it were done eagerly, other passes wouldn't have sufficient information about the structure of the buffer.

  • Is deprecating Store/Load acceptable, instead using BufferStore/BufferLoad throughout all lowering steps? Yes, as this gives a single uniform way to access buffers, regardless of the lowering step. The one concern is that we should port all existing functionality. For example, the vload/vstore methods in Buffer, which currently return Load/Store respectively, should not be removed, and instead should be updated to return flattened BufferLoad/BufferStore.

  • RampNode should be treated as a compiler internal, and shouldn't be easily constructible by users as indices into buffers. The preferred method to represent vectorized access is to have a buffer access within a vectorized loop, then allow tir.transform.VectorizeLoop to insert the RampNode. This matches previous behavior, where RampNode could occur in flattened Store/Load, while BufferLoad/BufferStore avoided RampNodes to maintain easy analysis of accessed locations.

  • Passes that change buffer dimensionality (e.g. InjectDoubleBuffer) should either be moved before the StorageFlatten/FlattenBuffer, or should be rewritten to instead resize the buffer, rather than changing the dimensionaltiy. The former would require the pass to also update the axis separators to be used when flattening.

Copy link
Contributor

@areusch areusch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lunderberg did a pass through this, a couple comments

doesn't hold in all cases. For example, texture memory on a GPU
requires two indices to access.

In addition, computations that are semantically identical (e.g. 2-d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, but this is kind of the core piece of TVM today. could you elaborate why this impacts this RFC in particular? (i suspect it has something to do with moving the logical shape information, but want to clarify)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That part was intended to lead into the "including the option to present multi-dimensional indices to the low-level code generators" phrase a few paragraphs down. That the first two paragraphs in the Motivation section describe portions of the TVM codebase that could be improved, and the third indicates how these improvements would be made.

I'll add a clarifying sentence "These are currently handled on a case-by-case basis, such as using tvm::tir::builtin::texture2d_store in a CallNode." so that the first paragraph connects better with the third.

};
```

- After applying the transformations, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will it then be impossible to trace this backwards in a TIR program lowered through the compiler? how can we debug?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debugging would be done by tracing changes forward during the compilation process, such as with tvm.ir.instrument to view the before and after changes made from each lowering pass. I think this is the preferred direction for debugging, as it also handles existing cases, such as Prefetch nodes that are lowered to for-loops that implement the prefetching.

My concern with leaving the transformation attribute defined after usage is that the validity of an IRModule, or even what computation it will produce, then depends on where that IRModule occurs in the lowering process. Examples of each are below.

  • Ambiguously valid: Consider a transformation from NCHW to NCHWc layout, lambda n,c,h,w: [n, c//4, h, w, c%4]. This transformation is only valid to apply on a rank-4 tensor, and produces a rank-5 tensor. If the annotation is kept in the TIR graph following the transformation, then an invalid graph (incorrectly attempting to apply the transform on a rank-5 tensor) and a valid graph (already having applied the transform onto a rank-4 tensor, producing a rank-5) would have the same representation.

  • Ambiguous computation: Consider a transformation from NCHW to NHWC layout, lambda n,c,h,w: [n,h,w,c], and the following psuedo-code compute definition.

    buf.shape = [A,B,C,D]
    buf.layout_transforms = [lambda n,c,h,w: [n,h,w,c]]
    
    for a in range(A):
        for b in range(B):
            acc = 0.0
            for c in range(C):
                for d in range(D):
                    acc += buf[a,b,c,d]
                    
            for c in range(C):
                for d in range(D):
                    buf[a,b,c,d] = buf[a,b,c,d] / acc

    If the layout_transforms annotation is left in the TIR graph after being applied, then the above compute definition could be performing normalization along the height/width dimensions if the transformation hasn't yet been applied, or along the width/channel dimensions if the transformation has been applied.

rfcs/0039-buffer-physical-layout.md Outdated Show resolved Hide resolved
the `"buffer_axis_separators"` attribute of a primfunc. All
buffers whose `BufferNode::data` is a key in this map should be
flattened to an output buffer of dimension
`separators[buf->data].size()+1`. All other buffers should be
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the +1 related to the buffer_axis_separator? i'm not quite grasping the value of this map.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This map is what determines the rank of buffers, and the number of indices, that are presented to the low-level code generator. Rather than always flattening all input buffer axes into a single output buffer axis, the axis separators define groups of input buffer axes, and each group is flattened into an output buffer axis.

The +1 is because having N dividers between items in a list produces N+1 groups of items.

# After flattening to 1-d
x = Var(name="x")
with Allocate(x, shape=[3*2]):
val = BufferLoad(x, [15*2 + 10])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one thing i've always wished was a bit easier to decipher was these address computations--not because you can't decipher them, but because you're almost always interested in something else, and having to push that thing on the "stack" while you decipher the address computation is distracting. shall we introduce a type to represent these which could output additional debug info in codegen?

# Drawbacks
[drawbacks]: #drawbacks

This change may make it more difficult to reason about the memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is my main concern with this proposal. we should be careful that we make sure we keep the same level of clarity around which memory access pattern a schedule will trigger. while i think that efforts such as AutoScheduler will already make this more challenging than it is today (and therefore, this concern isn't confined to this proposal), we should make sure we evaluate this each time we complicate things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed entirely. My goal has been to have the default behavior, both for TE and for TIR, remain the same. The differences in memory access patterns would only occur when explicitly opted into, with Schedule.transform_layout in TE, or with the annotations in TIR.

rfcs/0039-buffer-physical-layout.md Outdated Show resolved Hide resolved
rfcs/0039-buffer-physical-layout.md Outdated Show resolved Hide resolved
- If a series of nested loops contains a `cache_read` or
`cache_write` stage, can these be recognized and reordered?

- Option 3: Expose the transformed axes to be used as part of a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean you'd write the schedule in terms of the physical layout? if so then many of my previous concerns about debuggability go away.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does, yes. The transform_layout method will return an object representing the transformed tensor. Later calls, such as s.vectorize or s.reorder, will accept the axes of the transformed tensor.

@vinx13
Copy link
Member

vinx13 commented Dec 7, 2021

the api side of transform_layout looks good, let's add additional examples of scheduling with returned new axes and get this in

Since we decided against RFC#0042, the reference is no longer needed
here.
Also, removed implementation details that didn't match the
implementation.  A new stage wasn't necessary, and the loop iterators
could instead be updated in the existing stage, similar to `fuse()`
and `split()`.
@Lunderberg
Copy link
Contributor Author

Sounds good! I've updated with examples of scheduling with the returned new axes, which work in the implementation posted in PR#9727.

@wrongtest-intellif
Copy link

wrongtest-intellif commented Dec 14, 2021

@Lunderberg Hi, I am much interested in transform_layout but my team depends totally on TensorIR schedule instead of TE. Could you kindly provide more design points on TensorIR side? It would be great if we can enjoy this preview feature in TensorIR. It is really useful for us.

We have implemented some TensorIR primitives to serve similar purposes in form below to mutate the Buffer object's layout, strides and dtypes.

s.replace_buffer_obj(block, write_buffer_idx, *a set of rewrite callbacks)

Since generally all buffer accesses are multi-dimensional in TensorIR schedule phase, the implementation is a bit easier (just something like a pass to substitute the buffer object and do not require full pipeline modifications) than in TE, if no extra representative variables are introduced. Is the transform_layout would also be like above?

s.transform_layout(block, buffer_idx, remap_func)

Another form we use is just a duality of loop transformation primitives, where representative variables for buffer or axis of buffer are tracked into TensorIR schedule states.

n, c, h, w = s.get_buffer_axes(block, buffer_idx)
c_outer, c_inner = s.split_for_buffer(c, factor=8)
s.reorder_for_buffer(n, c_outer, h, w, c_inner)

# for `transform_layout` look like this?
buffer_rv = s.get_buffer(some identifier)
new_buffer_rv = s.transform_layout(buffer_rv, remap_func)

Is it possible to provide both integrated transform_layout primitive and step by step primitives for user's convenience?

Very glad to know your opinions! :)

Initial example used a buffer shape of `[2,3]`, which was smaller than
the indices used in the example.
@vinx13
Copy link
Member

vinx13 commented Dec 14, 2021

@wrongtest I'm working on the TensorIR side and have a draft version of transform_layout. The current implementation is
void TransformLayout(ScheduleState self, const StmtSRef& block_sref, int buffer_index, bool is_write_index, const IndexMap& index_map);
It applies the mapping function to all access indices to the target buffer, infer the new shape of the buffer and substitute it with the new buffer. I'll send a PR for that soon

The `te.compute` function can be used to define an updated layout.
However, this introduces a new tensor that must be inlined to avoid
additional memory allocation, and cannot be used for input
parameters.
Copy link
Member

@masahi masahi Dec 31, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading this, I realized that your proposal might solve the issue of N-d input dense / batch_matmul support discussed in apache/tvm#8412 and https://discuss.tvm.apache.org/t/relay-nn-does-relay-nn-dense-supports-multi-dimensional-input/10343/7. The natural approach to enable N-d support is to wrap the existing 2-d compute / schedule definitions with reshape, and eliminate explicit reshape via compute_inline.

So we will have topi.reshape (N-d to 2-d) -> 2-d dense -> topi.reshape (2-d to N-d). While the first reshape can be trivially eliminated, the second one cannot (TensorIR has reverse_compute_inline, which may work for the second reshape). Currently, topi.reshape (N-d to 2-d) -> 2d dense -> topi.reshape (2-d to N-d) pipeline does not work because TE bound inference fails to infer the precise bound for the 2-d write buffer of the dense schedule. I don't fully understand the reason, but it seems the failure stems from the fact that the output buffer is N-d while the local write buffer is 2-d.

Since the underlying physical memory layout doesn't change before and after reshape (it just collapses multiple "batch" dimensions into one), libraries like cudnn and mkl, that operates on pointers, supports N-d dense out of the box. So introducing explicit reshapes and eliminate them after-the-fact don't feel right. I've been bothered with this problem for some time, but transform_layout sounds like a perfect solution. Do you think transform_layout will enable supporting N-d dense and batch_matmul using existing 2-D schedules?

Note that the exact same problem arises in im2col based convolution: We need to "reshape" 2-D GEMM output of shape (N * H * W, C) to NHWC layout. Since the underlying physical memory layout doesn't change, explicit reshape should never be necessary. Overcoming this problem is a prerequisite for implementing tensorcore-based conv2d kernels based on the implicit GEMM algorithm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the application, and I think that it would be able to apply in this case with relatively small changes. I had mainly been considering cases where the computation being performed is semantically identical, but with a different underlying data layout. (e.g. conv2d_nhwc and conv2d_hwcn) This is the reverse case, where the computation being performed is semantically different (e.g. "matrix multiply using indices 3/4 of a 4-d tensor" as opposed to "matrix multiply using indices 2/3 of a 3-d tensor"), but has an identical underlying data layout, and should share the same schedules.

Implementing it would have two steps. The first would allow the topi.nn.batch_matmul to produce the different compute definitions for each tensor dimension, and the second would schedule any of those different compute definitions with the same shared function.

  • Modifying the compute definition to accept N-d tensors.

    • Extracting the tensor shapes as *other_dim, N, M = tensor.shape instead of batch, N, M = tensor.shape.
    • Defining the compute definition as lambda *indices: A[(*indices[:-2], indices[-2], k)] * B[(*indices[-2], k, indices[-1])] instead of lambda batch, i, j = A[batch, i, k] * B[batch, k, j]. (I made an improved parameter inspection for .transform_layout that, if extracted out to be used by te.compute could improve the readability to lambda *other_indices, i, j = A[(*other_indices, i, k)] * B[(*other_indices, k, j)].)
  • Modifying the schedule definition to accept N-d tensors.

    • Defining layout transformation function
      def flatten_all_but_last_two(*other_indices, i, j):
          batch = 0
          for dim,index in zip(C.shape, index):
              batch = dim*batch + index
          return [batch, i, j]
    • Extracting the output tensor axes as b, y, x = s[C].transform_layout(flatten_all_but_last_two) instead of b, y, x = s[C].op.axis.

That said, as I was thinking through it, I don't think it strictly requires the transform_layout support in order to be implemented. Since there's no change to the underlying data layout, generalizing the schedule could be done by changing the loop iteration order, which can already be done with Stage.fuse.

  • Modifying the schedule definition to accept N-d tensors.
    • Extracting the output tensor axes as *other_axes, y, x = s[C].op.axis instead of b, y, x = s[C].op.axis.
    • Fusing the batch axes as b = s[C].fuse(*other_axes).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, as I was thinking through it, I don't think it strictly requires the transform_layout support in order to be implemented.

Thanks, I've been thinking exclusively about using 2-d compute / schedules as they are with reshape + inline, so this simple solution didn't really occur to me. Partly because I was (1) lazy and (2) not sure if we can express "N-d indexing" with TE (like A[(*indices[:-2], indices[-2], k)] you showed above). I did try b = s[C].fuse(*other_axes) thing because that's required anyway for the reshape-based path as well, and that's where I hit the issue of imprecise bounds inference (second paragraph in my post above).

I was making a wrong deduction that removing the need for explicit reshape + inline would let me workaround the limitation of TE bounds inference, and so transform_layout would "solve" this long-standing issue of mine. But I realized that these are distinct problems and that rewriting the compute and schedule for N-d wouldn't address the underlying problem. Thanks for helping me clearing my thoughts!

all indices as named arguments, or an arbitrary number of indices.


- What convention should be used for buffer indexing?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank for the update, I agree with the discussion here. It would be great to also mention the actions needed to the existing passes / codegen, or the plan for the future work

@TejashShah
Copy link

TejashShah commented Feb 16, 2022

cc @manupa-arm @Mousius

@junrushao
Copy link
Member

Thank you all for the discussion @Lunderberg @vinx13 @areusch!

@junrushao junrushao merged commit b675ef8 into apache:main Feb 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants