Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix how num_elements is computed in Dense node #512

Merged
merged 1 commit into from
Feb 22, 2020

Conversation

k-ye
Copy link
Member

@k-ye k-ye commented Feb 21, 2020

It seems like num_elements was miscomputed as element_size to me, so I'd like you to help verify if this fix is correct. BTW, I guess the canonical way to write this should be

auto num_elements = meta->get_num_elements(meta, node);

But then it seems that StructMeta::get_num_elements just points to Dense_get_num_elements for dense SNodes?

(Sorry for the reduced works these days, I'm busy following the news on COVID-19 here in Tokyo 😂 )

@k-ye k-ye requested a review from yuanming-hu February 21, 2020 14:30
@k-ye
Copy link
Member Author

k-ye commented Feb 22, 2020

BTW, I'd like to support bitmask in Metal and wonder if the following approach makes sense to you:

  1. I think we can follow the existing memory layout for dense + bitmasked, in the sense that data come first, then followed by bitmask metadata (I'll probably have to use 32-bit for each mask, but memory-wise it is the same)
  2. I don't really have access to those helpers defined in node_dense.h, so will have to rewrite them in Metal. Fortunately, they are mostly trivial for dense SNodes.
    1. It seems like the only missing helper is SNodeLookupStmt::activate. Metal backend already has some logic to deal with lookup_element.
    2. Same for SNodeOpStmt, at a minimum we need to support is_active and deactivate.
    3. I guess a real fancy way is to add another amalgamation step in the build process, so that some runtime helpers can be copied to that Metal helper file, then we won't have to duplicate code, sigh. But i'm not sure if amalgamation itself is easy to maintain... I thought you did something in the legacy Taichi?
  3. Inside codegen_llvm.h:: create_offload_struct_for
    void create_offload_struct_for(OffloadedStmt *stmt, bool spmd = false) {

    I saw quite a chunk of code of preparing the lower/upper bound and loop indices, and refining physical coordinations. It also seems that each GPU thread handles not only one, but all elements whose indices are between lower and upper? I wonder if we can take the simpler approach here: Each Metal thread just handles one single element, and its index is just thread_id + begin (aka identical to how range_for is handled). Will this cause a problem..? On the other hand, I don't think this could handle the sparsity case in an efficient way for nested structures. For example: root.bitmask(ti.i, 16).bitmask(ti.j, 16). If index a in the first bitmask is not activated, probably the existing implementation can just skip the second bitmask as a whole for a?
  4. Offloads's clear_list and listgen: have to ignore these two. These two seem to manage the ListManager data structure, while ListManager itself is for managing memory? If we can't support dynamic memory in Metal (or OpenGL), can we simply ignore these two tasks, or all ListManager related ops?

@yuanming-hu
Copy link
Member

It seems like num_elements was miscomputed as element_size to me.

Nice catch! I made a mistake there... Unfortunately, the tests are not strong enough to trigger this...

(Sorry for the reduced works these days, I'm busy following the news on COVID-19 here in Tokyo joy )

Please do not feel pressured at all - I believe the first rule of open source contribution is to have fun :-) Also please take care in Tokyo - there seems to be a risk that the coronavirus will break out... I'm recently traveling so my response might be slightly delayed.

@k-ye
Copy link
Member Author

k-ye commented Feb 22, 2020

Nice catch! I made a mistake there... Unfortunately, the tests are not strong enough to trigger this...

Yeah... is bitmasked enabled in LLVM Taichi?

Also please take care in Tokyo - there seems to be a risk that the coronavirus will break out...

Thanks for saying that...! i'm thinking of heading back to China, which IMO is a lot safer right now 😂

@yuanming-hu
Copy link
Member

Yeah... is bitmasked enabled in LLVM Taichi?

It is, but rather untested...

BTW, I guess the canonical way to write this should be
auto num_elements = meta->get_num_elements(meta, node); But then it seems that
StructMeta::get_num_elements just points to Dense_get_num_elements for dense SNodes?

Not sure if I clearly get your point, but StructMeta and DenseMeta share the same bits.

I'm thinking about your bitmask in Metal plan now...

@k-ye
Copy link
Member Author

k-ye commented Feb 22, 2020

Not sure if I clearly get your point, but StructMeta and DenseMeta share the same bits.

Hm, IMO the most standard way to write this is probably to call the getter of num_elements defined on StructMeta to get the number of elements. But in this case, that getter method is actually just Dense_get_num_elements. So instead of writing meta->get_num_elements(), i think we can explicitly write out Dense_get_num_elements()? (Or we don't even have to call anything at all, because according to the implementation, this number is always max_num_elements. But then, we would couple this part with the implementation, so it's not that great...)I guess this is mostly a code clarity question...

@yuanming-hu
Copy link
Member

I think we can follow the existing memory layout for dense + bitmasked, in the sense that data come first, then followed by bitmask metadata (I'll probably have to use 32-bit for each mask, but memory-wise it is the same)

Yeah, 32-bit is reasonable. Actually, I'm not sure if the current 64-bit atomic_or implementation is more efficient than the 32-bit implementation. Maybe we should use 32-bit for CUDA as well.

I don't really have access to those helpers defined in node_dense.h, so will have to rewrite them in Metal. Fortunately, they are mostly trivial for dense SNodes.

Yeah, rewriting them in Metal sounds the easier to me. This is a bit unfortunate since there's some code duplication...

I guess a real fancy way is to add another amalgamation step in the build process, so that some runtime helpers can be copied to that Metal helper file, then we won't have to duplicate code, sigh. But i'm not sure if amalgamation itself is easy to maintain... I thought you did something in the legacy Taichi?

This is a fancy solution. I wouldn't worry too much about amalgamation (see https://github.com/taichi-dev/taichi/blob/legacy/misc/amalgamate.py). However, is the Metal language a superset of node_dense.h? We might end up with defining some macros to make sure node_dense.h can be compiled by both C++ and Metal (and potentially OpenGL Compute Shaders/OpenCL etc). Not sure how tricky that is.

@k-ye
Copy link
Member Author

k-ye commented Feb 22, 2020

However, is the Metal language a superset of node_dense.h?

Metal is a subset of C++14, with some keyword extensions... So i guess for amalgamation, instead of having helpers.metal.h directly include node_dense.h, we might have to have a common helper header, which can be included by both helpers.metal.h and node_dense.h. But I don't think there's much point of doing this, if this cannot be shared more widely (e.g. it will be much more useful if this can also be used in the OpenGL compute shader backend..)

@yuanming-hu
Copy link
Member

(sorry about my slow reply)

Yeah, my feeling is that, the complexity of the amalgamation way will only be compensated if the C-style data structure definitions can be used for more than one backends.

I can imagine that as we are getting more and more backends, maintaining the source code of these SNode would be a disaster. Fortunately, most source2source backends use a C-like language, so reusing a single header file sounds like the only solution to DRY.

I wonder if we can take the simpler approach here: Each Metal thread just handles one single element, and its index is just thread_id + begin (aka identical to how range_for is handled). Will this cause a problem..?

It is the grid-strided loop trick: https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ The goal is to avoid reading loop bounds from GPU memory. If you have lower/upper bounds as results of the previous kernel, then you don't really want to do a device synchronization read that back to CPU memory, so that you know the gridDim (which depends on your previous kernel) of your current kernel.

These two seem to manage the ListManager data structure, while ListManager itself is for managing memory?

ListManager is a low-cost abstraction of a (very) long list. It can contain at most 2**31 elements, so you can consider it to be infinitely long.

If we can't support dynamic memory in Metal (or OpenGL), can we simply ignore these two tasks, or all ListManager related ops?

Actually, if we pre-allocate a huge chunk of memory for Metal and zero-fill it, then dynamic memory allocation is not impossible. The only downside is that we'll have a non-adaptive memory consumption and the user is in charge of computing a memory usage upper bound. This is not too bad in many cases.

@yuanming-hu
Copy link
Member

Hm, IMO the most standard way to write this is probably to call the getter of num_elements defined on StructMeta to get the number of elements. But in this case, that getter method is actually just Dense_get_num_elements. So instead of writing meta->get_num_elements(), i think we can explicitly write out Dense_get_num_elements()? (Or we don't even have to call anything at all, because according to the implementation, this number is always max_num_elements. But then, we would couple this part with the implementation, so it's not that great...)I guess this is mostly a code clarity question...

Yeah you are right. I assume the LLVM inliner/optimizer will figure this out automatically.

@yuanming-hu
Copy link
Member

Merging this in but feel free to continue the discussions :-)

@yuanming-hu yuanming-hu merged commit 0cc4f3c into taichi-dev:master Feb 22, 2020
@k-ye
Copy link
Member Author

k-ye commented Feb 22, 2020

It is the grid-strided loop trick: https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ The goal is to avoid reading loop bounds from GPU memory.

Oh that's a nice thing to learn about! Does it mean that in the CUDA impl, each Taichi kernel will launch one CUDA kernel, where each CUDA thread could process both 1) multiple elements within a single offloaded task, and 2) multiple offloaded tasks?

Actually, if we pre-allocate a huge chunk of memory for Metal and zero-fill it, then dynamic memory allocation is not impossible.

Right, I remember we've touched on this idea before. If we need to eliminate the dependency on unified memory completely, then this is the only way to support dynamic memory on all backends?

@k-ye k-ye deleted the bit branch February 22, 2020 05:23
@yuanming-hu
Copy link
Member

  1. multiple elements within a single offloaded task

Yes

  1. multiple offloaded tasks?

No

Right, I remember we've touched on this idea before. If we need to eliminate the dependency on unified memory completely, then this is the only way to support dynamic memory on all backends?

Exactly!! And I'm planning to get rid of unified memory: #412 (comment)

... On the other hand, I don't think this could handle the sparsity case in an efficient way for nested structures ...

You are right. That's why we need a listgen for bitmasks, to get a dense list of active blocks...

@k-ye
Copy link
Member Author

k-ye commented Feb 22, 2020

Hmm.. OK, maybe I should consider supporting listgen then. But then that will definitely cause much more code to be duplicated, because we need to replicate the entire Runtime logic and data structures into Metal X(.

The other idea i had was a heterogeneous backend that mixes CPU with Metal (or OpenGL) #396 (comment). We use the GPU whenever we can, and fallback to CPU if necessary...

@yuanming-hu
Copy link
Member

Hmm.. OK, maybe I should consider supporting listgen then. But then that will definitely cause much more code to be duplicated, because we need to replicate the entire Runtime logic and data structures into Metal X(.

Yeah we should definitely consider the code duplication issue seriously...

The other idea i had was a heterogeneous backend that mixes CPU with Metal (or OpenGL) #396 (comment). We use the GPU whenever we can, and fallback to CPU if necessary...

To be honest I don't think that's a great idea... Using CPU for listgen will cause a lot of unified memory page faults (across CPU/GPU memory), which will be extremely slow... :-(

@k-ye
Copy link
Member Author

k-ye commented Feb 22, 2020

I see :( Let me worry about the GUI slowness first then XD

@k-ye k-ye mentioned this pull request Feb 24, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants