-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix how num_elements is computed in Dense node #512
Conversation
BTW, I'd like to support
|
Nice catch! I made a mistake there... Unfortunately, the tests are not strong enough to trigger this...
Please do not feel pressured at all - I believe the first rule of open source contribution is to have fun :-) Also please take care in Tokyo - there seems to be a risk that the coronavirus will break out... I'm recently traveling so my response might be slightly delayed. |
Yeah... is
Thanks for saying that...! i'm thinking of heading back to China, which IMO is a lot safer right now 😂 |
It is, but rather untested...
Not sure if I clearly get your point, but I'm thinking about your |
Hm, IMO the most standard way to write this is probably to call the getter of |
Yeah, 32-bit is reasonable. Actually, I'm not sure if the current 64-bit
Yeah, rewriting them in Metal sounds the easier to me. This is a bit unfortunate since there's some code duplication...
This is a fancy solution. I wouldn't worry too much about amalgamation (see https://github.com/taichi-dev/taichi/blob/legacy/misc/amalgamate.py). However, is the Metal language a superset of |
Metal is a subset of C++14, with some keyword extensions... So i guess for amalgamation, instead of having |
(sorry about my slow reply) Yeah, my feeling is that, the complexity of the amalgamation way will only be compensated if the C-style data structure definitions can be used for more than one backends. I can imagine that as we are getting more and more backends, maintaining the source code of these SNode would be a disaster. Fortunately, most source2source backends use a C-like language, so reusing a single header file sounds like the only solution to DRY.
It is the grid-strided loop trick: https://devblogs.nvidia.com/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/ The goal is to avoid reading loop bounds from GPU memory. If you have lower/upper bounds as results of the previous kernel, then you don't really want to do a device synchronization read that back to CPU memory, so that you know the gridDim (which depends on your previous kernel) of your current kernel.
ListManager is a low-cost abstraction of a (very) long list. It can contain at most
Actually, if we pre-allocate a huge chunk of memory for Metal and zero-fill it, then dynamic memory allocation is not impossible. The only downside is that we'll have a non-adaptive memory consumption and the user is in charge of computing a memory usage upper bound. This is not too bad in many cases. |
Yeah you are right. I assume the LLVM inliner/optimizer will figure this out automatically. |
Merging this in but feel free to continue the discussions :-) |
Oh that's a nice thing to learn about! Does it mean that in the CUDA impl, each Taichi kernel will launch one CUDA kernel, where each CUDA thread could process both 1) multiple elements within a single offloaded task, and 2) multiple offloaded tasks?
Right, I remember we've touched on this idea before. If we need to eliminate the dependency on unified memory completely, then this is the only way to support dynamic memory on all backends? |
Yes
No
Exactly!! And I'm planning to get rid of unified memory: #412 (comment)
You are right. That's why we need a listgen for bitmasks, to get a dense list of active blocks... |
Hmm.. OK, maybe I should consider supporting listgen then. But then that will definitely cause much more code to be duplicated, because we need to replicate the entire Runtime logic and data structures into Metal X(. The other idea i had was a heterogeneous backend that mixes CPU with Metal (or OpenGL) #396 (comment). We use the GPU whenever we can, and fallback to CPU if necessary... |
Yeah we should definitely consider the code duplication issue seriously...
To be honest I don't think that's a great idea... Using CPU for listgen will cause a lot of unified memory page faults (across CPU/GPU memory), which will be extremely slow... :-( |
I see :( Let me worry about the GUI slowness first then XD |
It seems like
num_elements
was miscomputed aselement_size
to me, so I'd like you to help verify if this fix is correct. BTW, I guess the canonical way to write this should beBut then it seems that
StructMeta::get_num_elements
just points toDense_get_num_elements
fordense
SNodes?(Sorry for the reduced works these days, I'm busy following the news on COVID-19 here in Tokyo 😂 )