Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Runtime] Change default alignment to 64 bytes. #12564

Closed
wants to merge 7 commits into from

Conversation

jwfromm
Copy link
Contributor

@jwfromm jwfromm commented Aug 23, 2022

One change made in #5252 (which added support for Hexagon to the runtime) was increasing the byte alignment from 64 to 128. This can cause problems when interacting with dlpack. For example tests/python/contrib/test_dlpack.py has a high chance of failing when run locally due to torch returning tensors with 64 rather than 128 byte alignment. I'm not sure why it doesnt fail in CI, perhaps the consistency of the environment always returns an appropriately aligned tensor.

Changing the default alignment allows interoperability with both torch and newer versions of numpy that support dlpack. I've slightly modified the torch test to run multiple times to make sure its behavior is consistent.

@kparzysz-quic do you think there are any drawbacks to decreasing the bit alignment requirement?

cc @areusch

@jwfromm jwfromm requested a review from kparzysz-quic August 23, 2022 21:01
@jwfromm
Copy link
Contributor Author

jwfromm commented Aug 23, 2022

@tqchen might be worth having you take a look to see if there are any potential problems with this change.

@github-actions github-actions bot requested a review from areusch August 23, 2022 21:06
Comment on lines 2095 to 2097
A = T.match_buffer(a, (16, 16), align=128, offset_factor=1)
B = T.match_buffer(b, (16, 16), align=128, offset_factor=1)
C = T.match_buffer(c, (16, 16), align=128, offset_factor=1)
A = T.match_buffer(a, (16, 16), align=64, offset_factor=1)
B = T.match_buffer(b, (16, 16), align=64, offset_factor=1)
C = T.match_buffer(c, (16, 16), align=64, offset_factor=1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mma does use align=128, so let's revert this change here :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah make sense, reverted. Is there anywhere else we'd want 128 byte alignment? I just did a global search and replace.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not super familiar with non-cuda cases. probably @Hzfengsy @spectrometerHBH could help!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about other backends. But it's fine if we can pass the tests

offset_factor=16,
scope="wmma.matrix_b",
)
C = T.match_buffer(
c, (m_dim, n_dim), out_dtype, align=128, offset_factor=16, scope="wmma.accumulator"
c, (m_dim, n_dim), out_dtype, align=64, offset_factor=16, scope="wmma.accumulator"
Copy link
Member

@masahi masahi Aug 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all CUDA code want 128 bit alignment in practice. cc @vinx13

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'll update this spot as well.

@u99127
Copy link
Contributor

u99127 commented Aug 24, 2022

One change made in #5252 (which added support for Hexagon to the runtime) was increasing the byte alignment from 64 to 128. This can cause problems when interacting with dlpack. For example tests/python/contrib/test_dlpack.py has a high chance of failing when run locally due to torch returning tensors with 64 rather than 128 byte alignment. I'm not sure why it doesnt fail in CI, perhaps the consistency

Nit: Bits and bytes conflict in commit message vs title. Is it 128byte alignment or 128bit alignment 😄 ...

@kparzysz-quic
Copy link
Contributor

Hexagon HVX vectors should be aligned at 128-byte boundary. If not, we'd have to use unaligned loads/stores. I think we have some code that relies on the current alignment.

This should really be a target option, maybe it's time to refactor this code...

@jwfromm jwfromm changed the title [Runtime] Change default alignment to 64 bits. [Runtime] Change default alignment to 64 bytes. Aug 24, 2022
@jwfromm
Copy link
Contributor Author

jwfromm commented Aug 24, 2022

@kparzysz-quic Fortunately Hexagon does use its own target specific alignment kHexagonAllocAlignment, so it would not be affected by this change. For other hardware targets like cuda, we could do something similar where we define k{TARGET}AllocAlignment in src/runtime/{TARGET}/{TARGET}_common.h and use kAllocAlignment as a fallback default.

@kparzysz-quic
Copy link
Contributor

Fortunately Hexagon does use its own target specific alignment kHexagonAllocAlignment, so it would not be affected by this change.

I guess it's ok then.

@tqchen
Copy link
Member

tqchen commented Aug 24, 2022

There are two parts of the perspective that comes into the story:

  • P0: The alignment of the allocator
  • P1: The alignment assumption of the compiler

P1 affects how compiler generates code, e.g. if we assume that the code is aligned to 128bytes, then compiler can generate aligned store/load to make efficient use of the code. P0 affects the array we allocate, we always need to allocate memory that is larger than P1 assumption made by the compiler.

Right now the two are roughly tied together, with the exception of kHexagonAllocAlignment, which enforces P0.

The PyTorch issue was caused by P1 -- we are making assumption of 128 byte alignment while pytorch arrays are commonly aligned to 64 bytes.

Currently, P1 is supplied by the buffer construct, which do not (yet) depend on the target. Agree with @kparzysz-quic that minimum alignment requirement can be part of the target, and we need to find good ways to set them onto the buffer.

As a short workaround, we can detect P1 settings(of buffers) in early stage of TIR compilation and amend them for devices like HVX, or provide such assumptions in HVX build. Note that our runtime already satisfies these assumptions(P0) as per kHexagonAllocAlignment, we just need to make sure compiler can take benefit of them

@u99127
Copy link
Contributor

u99127 commented Aug 24, 2022

One change made in #5252 (which added support for Hexagon to the runtime) was increasing the byte alignment from 64 to 128. This can cause problems when interacting with dlpack. For example tests/python/contrib/test_dlpack.py has a high chance of failing when run locally due to torch returning tensors with 64 rather than 128 byte alignment. I'm not sure why it doesnt fail in CI, perhaps the consistency of the environment always returns an appropriately aligned tensor.

Changing the default alignment allows interoperability with both torch and newer versions of numpy that support dlpack. I've slightly modified the torch test to run multiple times to make sure its behavior is consistent.

@kparzysz-quic do you think there are any drawbacks to decreasing the bit alignment requirement?

cc @areusch

CC @ekalda, @lhutton1 - What's the impact of changing the alignment from 128 to 64 in the context of Ethos-U ? Is this likely to have a meaningful performance or correctness issue in this case ?

Ramana

@jwfromm
Copy link
Contributor Author

jwfromm commented Aug 24, 2022

Thanks for that context @tqchen. Would you like to see any of your suggested improvements in this PR or do you think this is an appropriate first step?

@ekalda
Copy link
Contributor

ekalda commented Aug 24, 2022

I don't think this will cause problems for Ethos-U since its accesses are 16 byte aligned (but let's see if the tests pass). Regarding to why this failure is not visible in CI, could it be related to this issue? #12529

@tqchen
Copy link
Member

tqchen commented Aug 24, 2022

The previous comment #3163 (comment) would make it work for temp data allocations(change

inline int GetTempAllocaAlignment(DataType type, int32_t const_size) {
to depend on Target and return different value in different setting).

For buffer declarations, the main issue is that Buffer creation is not target dependent as of now, as a result, https://github.com/apache/tvm/blob/main/src/tir/ir/buffer.cc#L570 defaults to the global value.

We can introduce a minimum_data_alignment attr to target, and modify buffer->data_alignment when the user run things through build.

Note all of these only affect the default behavior. We can always explicitly declare the buffer with the correct alignment assumption being made for a target.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants