[TIR][Schedule] DecomposePadding #12174
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi there, this PR wants to introduce a new TIR schedule primitive
schd.decompose_padding(block, loop)
.For padded
conv
orpooling
ops, there is a typical padding pattern:Pad[_] = T.if_then_else(pad_predicate, X[_], pad_value)
which could be decomposed into two parts:
memset
routinememcpy
routineThe primitive's signature is alike
decompose_reduction
, which provides a target block and a loop position to insert newly created "init" block. It is helpful for infrastructures with high-performance memset/memcpy routines, and leverage the complexity to process padding conditions in the main compute block.Example
decompose_padding(block, i)
Alternatives and drawbacks
From the graph perspective, one may be able to totally fold out the block which pad the input buffer. While the primitive seems to be more useful when one wants to perform padding in the intra-primfunc buffers.
One could also
compute-inline
the block perform padding, this introduces conditions in the main computation block, which may or may-not get optimized well, depending on the concrete target.Currently there are schedule ability limitations on created blocks after decomposition. They can not be then
compute-at
ed orcompute-inline
d. Because multiple blocks write to the same buffer break the stage pipeline property.