-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Runtime] Change default alignment to 64 bytes. #12564
Conversation
@tqchen might be worth having you take a look to see if there are any potential problems with this change. |
python/tvm/tir/schedule/schedule.py
Outdated
A = T.match_buffer(a, (16, 16), align=128, offset_factor=1) | ||
B = T.match_buffer(b, (16, 16), align=128, offset_factor=1) | ||
C = T.match_buffer(c, (16, 16), align=128, offset_factor=1) | ||
A = T.match_buffer(a, (16, 16), align=64, offset_factor=1) | ||
B = T.match_buffer(b, (16, 16), align=64, offset_factor=1) | ||
C = T.match_buffer(c, (16, 16), align=64, offset_factor=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mma does use align=128
, so let's revert this change here :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah make sense, reverted. Is there anywhere else we'd want 128 byte alignment? I just did a global search and replace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not super familiar with non-cuda cases. probably @Hzfengsy @spectrometerHBH could help!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure about other backends. But it's fine if we can pass the tests
python/tvm/tir/tensor_intrin/cuda.py
Outdated
offset_factor=16, | ||
scope="wmma.matrix_b", | ||
) | ||
C = T.match_buffer( | ||
c, (m_dim, n_dim), out_dtype, align=128, offset_factor=16, scope="wmma.accumulator" | ||
c, (m_dim, n_dim), out_dtype, align=64, offset_factor=16, scope="wmma.accumulator" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all CUDA code want 128 bit alignment in practice. cc @vinx13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'll update this spot as well.
Nit: Bits and bytes conflict in commit message vs title. Is it 128byte alignment or 128bit alignment 😄 ... |
Hexagon HVX vectors should be aligned at 128-byte boundary. If not, we'd have to use unaligned loads/stores. I think we have some code that relies on the current alignment. This should really be a target option, maybe it's time to refactor this code... |
@kparzysz-quic Fortunately Hexagon does use its own target specific alignment |
I guess it's ok then. |
There are two parts of the perspective that comes into the story:
P1 affects how compiler generates code, e.g. if we assume that the code is aligned to 128bytes, then compiler can generate aligned store/load to make efficient use of the code. P0 affects the array we allocate, we always need to allocate memory that is larger than P1 assumption made by the compiler. Right now the two are roughly tied together, with the exception of The PyTorch issue was caused by P1 -- we are making assumption of 128 byte alignment while pytorch arrays are commonly aligned to 64 bytes. Currently, P1 is supplied by the buffer construct, which do not (yet) depend on the target. Agree with @kparzysz-quic that minimum alignment requirement can be part of the target, and we need to find good ways to set them onto the buffer. As a short workaround, we can detect P1 settings(of buffers) in early stage of TIR compilation and amend them for devices like HVX, or provide such assumptions in HVX build. Note that our runtime already satisfies these assumptions(P0) as per |
CC @ekalda, @lhutton1 - What's the impact of changing the alignment from 128 to 64 in the context of Ethos-U ? Is this likely to have a meaningful performance or correctness issue in this case ? Ramana |
Thanks for that context @tqchen. Would you like to see any of your suggested improvements in this PR or do you think this is an appropriate first step? |
I don't think this will cause problems for Ethos-U since its accesses are 16 byte aligned (but let's see if the tests pass). Regarding to why this failure is not visible in CI, could it be related to this issue? #12529 |
The previous comment #3163 (comment) would make it work for temp data allocations(change tvm/src/tir/transforms/ir_utils.h Line 166 in 8146a9b
For buffer declarations, the main issue is that Buffer creation is not target dependent as of now, as a result, https://github.com/apache/tvm/blob/main/src/tir/ir/buffer.cc#L570 defaults to the global value. We can introduce a minimum_data_alignment attr to target, and modify buffer->data_alignment when the user run things through build. Note all of these only affect the default behavior. We can always explicitly declare the buffer with the correct alignment assumption being made for a target. |
One change made in #5252 (which added support for Hexagon to the runtime) was increasing the byte alignment from 64 to 128. This can cause problems when interacting with
dlpack
. For exampletests/python/contrib/test_dlpack.py
has a high chance of failing when run locally due to torch returning tensors with 64 rather than 128 byte alignment. I'm not sure why it doesnt fail in CI, perhaps the consistency of the environment always returns an appropriately aligned tensor.Changing the default alignment allows interoperability with both torch and newer versions of numpy that support dlpack. I've slightly modified the torch test to run multiple times to make sure its behavior is consistent.
@kparzysz-quic do you think there are any drawbacks to decreasing the bit alignment requirement?
cc @areusch