Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] Should this line be divided by Layout::kFactor? #502

Closed
peisun1115 opened this issue May 26, 2022 · 6 comments
Closed

[QST] Should this line be divided by Layout::kFactor? #502

peisun1115 opened this issue May 26, 2022 · 6 comments
Labels
question Question

Comments

@peisun1115
Copy link
Contributor

peisun1115 commented May 26, 2022

Should this line be divided by Layout::kFactor ? Or it should just use ref.stride(0) (ref from ctor).

stride_ in this class is multiplied by kFactor in line 526 but add_tile_offset is supposed to add an offset based on threadblock tile level

Very likely I am wrong :) but could it be because this code is rarely used as no code needs to set this cord.stride() > 0 ?

@hwu36
Copy link
Collaborator

hwu36 commented May 27, 2022

I actually thinks you are mostly correct though I am not 100% sure without a unit test to confirm it. For the Crosswise layout, when we move along K dimension to compute GEMM, we move along the contiguous dimension. So coord.stride() is always 0.

Instead of divide kFactor, I think we need to divide sections_. If you are interested, you can make the change and try all the unit tests first.

@peisun1115
Copy link
Contributor Author

Thank you for your response! Why should it be sections_?

Suppose the tensor shape is: pitch linear (128, 64). (c 128, s 64). Threadblock tile is (32, 32). Element = half_t

If i understand correctly,
ref.stride(0) = 128,
Crosswise = 32,
kFactor = 2,
sections_ = 4,
sections_per_stage_ = 1
Shape::kStrided = 32
kElementsPerAccess = 128 / 16 = 8
stride_ = 128 * 2 / 8 = 32

one strided dim increment should add
128 * 32 elements?

which is the same as 1 * 32 * 32 *8 = 128 * 32 * 2

@hwu36
Copy link
Collaborator

hwu36 commented May 27, 2022

Think it again, now I think you are correct. In the unit of elements,

coord.strided() * Crosswise * kFactor * Shape::kStrided / kFactor * sections_ = coord.strided() * Crosswise * Shape::kStrided * sections_ = coord.strided() * stride_ / kFactor * Shape::kStrided

@peisun1115
Copy link
Contributor Author

peisun1115 commented May 27, 2022

Thank you!

I created a pull request but i am not able test it locally somehow (looks like related to this: NVlabs/instant-ngp#119)

it failed to compile. maybe my local gcc version is not compatible? maybe you can just fix it. thanks!

~/git/cutlass/build/examples/12_gemm_bias_relu$ make
Building CUDA object examples/12_gemm_bias_relu/CMakeFiles/12_gemm_bias_relu.dir/gemm_bias_relu.cu.o
/usr/include/c++/11/bits/std_function.h:435:145: error: parameter packs not expanded with ‘...’:
435 | function(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:435:145: note: ‘_ArgTypes’
/usr/include/c++/11/bits/std_function.h:530:146: error: parameter packs not expanded with ‘...’:
530 | operator=(_Functor&& __f)
| ^
/usr/include/c++/11/bits/std_function.h:530:146: note: ‘_ArgTypes’
make[2]: *** [examples/12_gemm_bias_relu/CMakeFiles/12_gemm_bias_relu.dir/build.make:76: examples/12_gemm_bias_relu/CMakeFiles/12_gemm_bias_relu.dir/gemm_bias_relu.cu.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:3593: examples/12_gemm_bias_relu/CMakeFiles/12_gemm_bias_relu.dir/all] Error 2
make: *** [Makefile:166: all] Error 2

@mnicely
Copy link
Collaborator

mnicely commented Jun 18, 2022

@peisun1115 did you resolve your issue?

@peisun1115
Copy link
Contributor Author

yes, it is resolved

jgli pushed a commit to jgli/cutlass that referenced this issue Nov 14, 2024
when using checkpoint_lvl=2, we all_gather_raw(x) without async_op=True.
So we don't need to wait for handle. Just skip.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Question
Projects
None yet
Development

No branches or pull requests

3 participants