-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-9390: [C++][Doc] Review compute function names #7755
Conversation
Also cc @xhochy @nealrichardson |
Valgrind works fine on this PR. |
Modified function names: * minmax -> min_max * binary_isascii -> string_isascii (only works on string types) * ascii_length -> binary_length (also make it work on binary types) * binary_contains_exact -> match_substring (other possibility: has_substring ?) * match -> index_in * isin -> is_in * list_value_lengths -> list_value_length * partition_indices -> partition_nth_indices (other kinds of partitioning would be possible, e.g. using a predicate) Document string predicate functions (ARROW-9444). Also fix the allocation of IsValid output buffer in certain cases.
0f1f905
to
66c0824
Compare
Rebased and fixed CI (hopefully). |
On the PR (#7593 (comment)), @xhochy mentioned that the "exact" part was important to him |
Well, "substring" clearly implies "exact". As for case-sensitivity, I hope that can be a flag in the options, because I don't think we want a combinatorial explosion of string matching functions. |
+1, that's a much clearer name IMO (although the R people might disagree ;-)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for doing this. Some minor comments which can be addressed as follow up also
auto func = std::make_shared<ScalarFunction>("binary_contains_exact", Arity::Unary()); | ||
auto exec_32 = BinaryContainsExact<StringType>::Exec; | ||
auto exec_64 = BinaryContainsExact<LargeStringType>::Exec; | ||
void AddMatchSubstring(FunctionRegistry* registry) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine, we can use match_substring_case_insensitive
for the case insensitive version
cc @xhochy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is fine with me.
Does this conflict with 1d7d919 which I just merged? |
No, it's rebased already. |
AppVeyor build: https://ci.appveyor.com/project/pitrou/arrow/builds/34089592 |
I think this can be merged. Any concerns? |
Nope. +1 here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't get to this before this merged, but a couple of thoughts.
@@ -944,7 +944,7 @@ void RegisterScalarStringAscii(FunctionRegistry* registry) { | |||
MakeUnaryStringBatchKernel<AsciiUpper>("ascii_upper", registry); | |||
MakeUnaryStringBatchKernel<AsciiLower>("ascii_lower", registry); | |||
|
|||
AddUnaryStringPredicate<IsAscii>("binary_isascii", registry); | |||
AddUnaryStringPredicate<IsAscii>("string_isascii", registry); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this string_is_ascii
? It seems we prefer is_x
to isx
(cf. isin
-> is_in
in this PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. Should we make a followup PR?
(if it's me, it's only tomorrow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can make this change today unless someone else wants to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also ascii_is_alnum
, etc. ;-)
UnaryStringBenchmark(state, "binary_contains_exact", &options); | ||
static void MatchSubstring(benchmark::State& state) { | ||
MatchSubstringOptions options("abac"); | ||
UnaryStringBenchmark(state, "match_substring", &options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there clear criteria for when the function gets a binary_
or string_
prefix and when it doesn't?
Related, if we haven't documented the naming conventions (haven't seen it yet but I'll keep reading), we should so that future developers/selves know what to call new functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Personally, I find prefixes a bit pointless or even distracting.
(or, here, plain misleading, since "binary_contains_exact" didn't work on binary arrays, only string arrays...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps someone else wants to think about a convention?
Followup from PR apache#7755: a new doc file wasn't checked in, and Sphinx only emits a warning.
Followup from PR #7755: a new doc file wasn't checked in, and Sphinx only emits a warning. Closes #7760 from pitrou/ARROW-9390-missing-file Authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
Modified function names:
(other possibility: has_substring ?)
(other kinds of partitioning would be possible, e.g. using a predicate)
Document string predicate functions (ARROW-9444).
Also fix the allocation of IsValid output buffer in certain cases.