-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-16757: [C++] Remove "scalar" output modality for ScalarKernel implementations, remove ValueDescr class #13521
Conversation
for (const auto& value : this->values) { | ||
result.emplace_back(value.descr()); | ||
result.emplace_back(value.type()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several places where we can probably drop to construct from all const DataType*
which is less expensive, this is one of them (I need to check that it doesn't cause tests to fail)
ArraySpan array; | ||
const Scalar* scalar; | ||
const Scalar* scalar = NULLPTR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I applied Antoine's suggested optimization
I was able to fix the R unit tests (it was just error messages that don't have shapes in them anymore), so I'll work on getting the other CI jobs to fail and then I think the glib stuff is that only thing that will need fixing |
@wesm Sure! I'll do it. |
74d08eb
to
908d88b
Compare
const std::shared_ptr<Scalar>& value) { | ||
return value->type; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is rather complicated but it saves us a good amount of boilerplate to implement serialization/ToString/equality for all the options classes. Not sure if there's a better way (more codegen?)
Re: SparseUnionScalar — unfortunately we can't use a static buffer of zeros because child array types can have arbitrarily large values (e.g. fixed_size_binary can be arbitrarily large, or fixed_size_list, etc.) |
This is an odd failure in the MinGW 32/64 C++ builds, anyone know what that's about? |
@amol- you mentioned you noticed this earlier - did you figure out what happened there? |
I've pushed changes for GLib.
I didn't notice it but it's not related to this change. I think that it's caused by version change of NumPy installed by MSYS2 package manager and ccache. It seems that we can clear GitHub Actions cache manually by GitHub Actions Cache API: https://docs.github.com/en/rest/actions/cache |
I couldn't do this... $ curl --verbose -X DELETE -H "Authorization: token ..." -H "Accept: application/vnd.github.v3+json" https://api.github.com/repos/apache/arrow/actions/caches/cpp-ccache-mingw64-cfa1684ec4da6404fc0c738c9563a75ca1148a7ca2e6fdfc6995e2833bbd0979
...
{
"message": "Not Found",
"documentation_url": "https://docs.github.com/rest/actions/cache#delete-a-github-actions-cache-for-a-repository-using-a-cache-id"
} I think that we need to add NumPy version to cache key. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I really like this and the scalar struct changes are acceptable IMHO.
Here are a bunch of assorted comments.
class UnionScalarPrinter(ScalarPrinter): | ||
class SparseUnionScalarPrinter(ScalarPrinter): | ||
""" | ||
Pretty-printer for arrow::UnionScalar and subclasses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty-printer for arrow::UnionScalar and subclasses. | |
Pretty-printer for arrow::SparseUnionScalar. |
return (f"{self._format_type()} of type {self.type}, " | ||
f"type code {type_code}, null value") | ||
eval_values = StdVector(self.val['value']) | ||
child_id = self.val['child_id'].cast(gdb.lookup_type('int')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the cast is needed as it's already a C int
?
(note: calls to gdb evaluation of C stuff can be costly, so this is not just pedantic)
|
||
|
||
|
||
class DenseUnionScalarPrinter(ScalarPrinter): | ||
""" | ||
Pretty-printer for arrow::UnionScalar and subclasses. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty-printer for arrow::UnionScalar and subclasses. | |
Pretty-printer for arrow::DenseUnionScalar and subclasses. |
More refactoring More refactoring More refactoring Checkpoint, still some ValueDescr to remove Clean up and delete some more scalar output code More refactoring More refactoring More refactoring, code cleaning more cleaning More cleaning checkpoint Get everything compiling again More refactoring exec.cc refactoring, get compiling again Handle scalar -> array span promotions Work to make all scalars more well formed Make union scalars more 'well formed' All things compiling again Fix some more stuff Fix some more tedious errors Fix some more things Fixed MapBuilder::AppendArraySlice bug checkpoint Fix more bugs C++ tests passing again, restore cumulative_sum scalar tests
1509ef8
to
89ab44e
Compare
Thanks everyone. I'll merge this once the CI passes so I can get busy with the next items |
It seems that my change was lost by force-push. I'll add it again. |
Oh sorry yes that’s my mistake |
This was a pretty painful refactoring project, but all for the better long term.
I will do my best to improve the PR summary to explain changes as I go through the PR myself, but the basic summary is that argument shape is no longer part of kernel signatures (or expressions or anything else) -- the "all scalar" evaluation case is handled by promoting scalars to ArraySpan of length 1. This enabled the deletion of many "duplicate" kernel implementations and uncovered some bugs in some kernel implementations as a result of passing many sliced length-1 inputs up-promoted from scalars.
This had painful knock-on effects because some scalar values -- sparse unions and some other non-primitive types -- were not "well-formed" enough to be able to be converted to ArraySpan without needing to allocate memory. As a result, I had to change the requirements (and in the case of sparse unions, the representation to correspond to a "single row view" of a SparseUnionArray -- i.e. the "value" for a sparse union is now a vector of scalars rather than a single scalar, even though only one of the values in the vector is the "active" scalar) of some scalar values to not allow null values. We should / could consider implementing scalars altogether internally as arrays of length 1 but we can discuss that as a follow up.
Some other interesting stuff in here is the new experimental
arrow::TypeHolder
struct which allows us to pass around eitherconst DataType*
orstd::shared_ptr<DataType>
in contexts where type ownership isn't strictly necessary. So now we can deal only with raw pointers when doing type signature checking / kernel dispatching, for example. We should look at refactoring other code where we deal withshared_ptr<DataType>
(but don't necessarily need to preserve ownership) to use TypeHolder.There's some unfinished business here: