-
Notifications
You must be signed in to change notification settings - Fork 930
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] Apply sliced_child
when calling to slice
#9219
Comments
This would break strings columns. And maybe lists? So this would only apply to structs, which would be pretty inconsistent behavior. |
Another related-similar issue is not just I'm not sure if this is also the case for strings column. |
Yes, it is. Otherwise it would require actually modifying device memory to update all of the values of the offset column to be relative to the offset of the parent. This is obviously expensive and not something we want to do except in cases where it is explicitly required. |
I see. It seems that deep slicing is necessary to avoid bugs, but it is expensive thus we try to avoid. I have an idea: Apply lazy evaluation concept: we do deep slicing but do not initialize the
Of course, actual implementation is more complex than this. But the idea above can:
Note that this is something similar to caching scalar value that we mentioned before. |
Deep slicing is not always necessary. We could add a specific statement to the developer guide that child columns are not sliced. This kind of coding error would ideally be caught by an appropriate gtest or during a PR review. |
Unfortunately its more complicated than that. When a strings/list column is sliced, the values of the offsets column remain unchanged. Meaning, the values of the offset column are still relative to the _un_sliced version of the parent. For example, a strings column:
If you were to slice off the last two elements of this strings column you'd have:
The values of the offsets column is still relative to the original unsliced column. So it's not just a matter of changing the singular offset of the |
I see. This is really context-dependent 😞 |
One idea would be to add a I don't think that would solve many of the problems you highlighted above, but still seems like a useful thing that would alleviate similar kinds of problems. |
This issue has been labeled |
This issue has been labeled |
I observe that there are a lot of bugs related to the situations when an API directly accesses the child columns of a sliced column instead of calling to
get_sliced_child
. As such, theslice
API is a kind of shallow slice, not a deep slice. Maybe shallow slice is more efficient as it can avoid unnecessary slicing of the children columns when we don't care, it has caused a lot of (potential) bugs that cost a lot of developer time.An instance of such bugs is here: #9218. In the past, I have also dealt with many similar situations but I could catch them immediately through unit tests. If a developer forgets to write unit tests for sliced input, the bug may be there.
I would like to rewrite
slice
into deep slicing, i.e., recursively calling toslice
on all children columns of the column being sliced. This way, when we access its children column through the APIschild_begin()
,child_end
, orchild(idx)
we will have the expected results all the time. Although we have talked about this before and didn't do anything as deep slicing is expensive, I still decided to raise the issue again as it still causes bugs.An alternative solution to this issue is to rename the existing
slice
API intoshallow_slice
then add anotherslice
version that does recursively callingshallow_slice
on the columns. So, a developer will only callshallow_slice
if he/she knows exactly that just the shallow version is needed in the context. Otherwise, a more expensiveslice
version will produce the correct results in most situations.The text was updated successfully, but these errors were encountered: