Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-16757: [C++] Remove "scalar" output modality for ScalarKernel implementations, remove ValueDescr class #13521

Merged
merged 9 commits into from
Jul 8, 2022

Conversation

wesm
Copy link
Member

@wesm wesm commented Jul 5, 2022

This was a pretty painful refactoring project, but all for the better long term.

I will do my best to improve the PR summary to explain changes as I go through the PR myself, but the basic summary is that argument shape is no longer part of kernel signatures (or expressions or anything else) -- the "all scalar" evaluation case is handled by promoting scalars to ArraySpan of length 1. This enabled the deletion of many "duplicate" kernel implementations and uncovered some bugs in some kernel implementations as a result of passing many sliced length-1 inputs up-promoted from scalars.

This had painful knock-on effects because some scalar values -- sparse unions and some other non-primitive types -- were not "well-formed" enough to be able to be converted to ArraySpan without needing to allocate memory. As a result, I had to change the requirements (and in the case of sparse unions, the representation to correspond to a "single row view" of a SparseUnionArray -- i.e. the "value" for a sparse union is now a vector of scalars rather than a single scalar, even though only one of the values in the vector is the "active" scalar) of some scalar values to not allow null values. We should / could consider implementing scalars altogether internally as arrays of length 1 but we can discuss that as a follow up.

Some other interesting stuff in here is the new experimental arrow::TypeHolder struct which allows us to pass around either const DataType* or std::shared_ptr<DataType> in contexts where type ownership isn't strictly necessary. So now we can deal only with raw pointers when doing type signature checking / kernel dispatching, for example. We should look at refactoring other code where we deal with shared_ptr<DataType> (but don't necessarily need to preserve ownership) to use TypeHolder.

There's some unfinished business here:

  • The GLib bindings must be refactored to follow this work

@github-actions
Copy link

github-actions bot commented Jul 5, 2022

for (const auto& value : this->values) {
result.emplace_back(value.descr());
result.emplace_back(value.type());
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several places where we can probably drop to construct from all const DataType* which is less expensive, this is one of them (I need to check that it doesn't cause tests to fail)

ArraySpan array;
const Scalar* scalar;
const Scalar* scalar = NULLPTR;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I applied Antoine's suggested optimization

@wesm
Copy link
Member Author

wesm commented Jul 5, 2022

@kou if my proposed solution to the SparseUnionScalar issue does not cause too much controversy (maybe wait a day or two for @pitrou or @lidavidm to look), I would appreciate your assistance refactoring the glib bindings to follow

@wesm
Copy link
Member Author

wesm commented Jul 6, 2022

I was able to fix the R unit tests (it was just error messages that don't have shapes in them anymore), so I'll work on getting the other CI jobs to fail and then I think the glib stuff is that only thing that will need fixing

@kou
Copy link
Member

kou commented Jul 6, 2022

@wesm Sure! I'll do it.

@wesm wesm force-pushed the ARROW-16757-scalar-output-modality branch from 74d08eb to 908d88b Compare July 6, 2022 11:51
const std::shared_ptr<Scalar>& value) {
return value->type;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is rather complicated but it saves us a good amount of boilerplate to implement serialization/ToString/equality for all the options classes. Not sure if there's a better way (more codegen?)

@wesm
Copy link
Member Author

wesm commented Jul 6, 2022

This works. I suppose if we really wanted, we could keep a static buffer of zeroes and construct the span from slices of that based on type but that'd be quite a bit of work for one particular case.

Re: SparseUnionScalar — unfortunately we can't use a static buffer of zeros because child array types can have arbitrarily large values (e.g. fixed_size_binary can be arbitrarily large, or fixed_size_list, etc.)

@wesm
Copy link
Member Author

wesm commented Jul 6, 2022

58/70 Test #59: arrow-python-test .........................***Failed 0.22 sec
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf

This is an odd failure in the MinGW 32/64 C++ builds, anyone know what that's about?

@lidavidm
Copy link
Member

lidavidm commented Jul 6, 2022

58/70 Test #59: arrow-python-test .........................***Failed 0.22 sec
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf

This is an odd failure in the MinGW 32/64 C++ builds, anyone know what that's about?

@amol- you mentioned you noticed this earlier - did you figure out what happened there?

@kou
Copy link
Member

kou commented Jul 7, 2022

I've pushed changes for GLib.

58/70 Test #59: arrow-python-test .........................***Failed 0.22 sec
RuntimeError: module compiled against API version 0x10 but this version of numpy is 0xf

This is an odd failure in the MinGW 32/64 C++ builds, anyone know what that's about?

I didn't notice it but it's not related to this change.

I think that it's caused by version change of NumPy installed by MSYS2 package manager and ccache.
MSYS2's NumPy package was updated to 1.23.0-1 from 1.22.4-3 recently. And this was caused from the update. (I think that https://github.com/apache/arrow/runs/7214056952?check_suite_focus=true is the first fail.)
It seems that ccache still uses build results with NumPy 1.22.4-3.

It seems that we can clear GitHub Actions cache manually by GitHub Actions Cache API: https://docs.github.com/en/rest/actions/cache
I'll try clearing GitHub Actions cache for MinGW 32/64 C++ builds.

@kou
Copy link
Member

kou commented Jul 7, 2022

I'll try clearing GitHub Actions cache for MinGW 32/64 C++ builds.

I couldn't do this...

$ curl --verbose -X DELETE -H "Authorization: token ..." -H "Accept: application/vnd.github.v3+json" https://api.github.com/repos/apache/arrow/actions/caches/cpp-ccache-mingw64-cfa1684ec4da6404fc0c738c9563a75ca1148a7ca2e6fdfc6995e2833bbd0979
...
{
  "message": "Not Found",
  "documentation_url": "https://docs.github.com/rest/actions/cache#delete-a-github-actions-cache-for-a-repository-using-a-cache-id"
}

I think that we need to add NumPy version to cache key.

@kou
Copy link
Member

kou commented Jul 7, 2022

I think that we need to add NumPy version to cache key.

#13534

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I really like this and the scalar struct changes are acceptable IMHO.
Here are a bunch of assorted comments.

class UnionScalarPrinter(ScalarPrinter):
class SparseUnionScalarPrinter(ScalarPrinter):
"""
Pretty-printer for arrow::UnionScalar and subclasses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pretty-printer for arrow::UnionScalar and subclasses.
Pretty-printer for arrow::SparseUnionScalar.

return (f"{self._format_type()} of type {self.type}, "
f"type code {type_code}, null value")
eval_values = StdVector(self.val['value'])
child_id = self.val['child_id'].cast(gdb.lookup_type('int'))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the cast is needed as it's already a C int?

(note: calls to gdb evaluation of C stuff can be costly, so this is not just pedantic)




class DenseUnionScalarPrinter(ScalarPrinter):
"""
Pretty-printer for arrow::UnionScalar and subclasses.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Pretty-printer for arrow::UnionScalar and subclasses.
Pretty-printer for arrow::DenseUnionScalar and subclasses.

wesm added 8 commits July 7, 2022 14:03
More refactoring

More refactoring

More refactoring

Checkpoint, still some ValueDescr to remove

Clean up and delete some more scalar output code

More refactoring

More refactoring

More refactoring, code cleaning

more cleaning

More cleaning

checkpoint

Get everything compiling again

More refactoring

exec.cc refactoring, get compiling again

Handle scalar -> array span promotions

Work to make all scalars more well formed

Make union scalars more 'well formed'

All things compiling again

Fix some more stuff

Fix some more tedious errors

Fix some more things

Fixed MapBuilder::AppendArraySlice bug

checkpoint

Fix more bugs

C++ tests passing again, restore cumulative_sum scalar tests
@wesm wesm force-pushed the ARROW-16757-scalar-output-modality branch from 1509ef8 to 89ab44e Compare July 7, 2022 20:19
@wesm
Copy link
Member Author

wesm commented Jul 7, 2022

Thanks everyone. I'll merge this once the CI passes so I can get busy with the next items

@kou
Copy link
Member

kou commented Jul 8, 2022

It seems that my change was lost by force-push. I'll add it again.

@wesm
Copy link
Member Author

wesm commented Jul 8, 2022

Oh sorry yes that’s my mistake

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants