Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9390: [C++][Doc] Review compute function names #7755

Closed
wants to merge 3 commits into from

Conversation

pitrou
Copy link
Member

@pitrou pitrou commented Jul 14, 2020

Modified function names:

  • minmax -> min_max
  • binary_isascii -> string_isascii (only works on string types)
  • ascii_length -> binary_length (also make it work on binary types)
  • binary_contains_exact -> match_substring
    (other possibility: has_substring ?)
  • match -> index_in
  • isin -> is_in
  • list_value_lengths -> list_value_length
  • partition_indices -> partition_nth_indices
    (other kinds of partitioning would be possible, e.g. using a predicate)

Document string predicate functions (ARROW-9444).

Also fix the allocation of IsValid output buffer in certain cases.

@pitrou pitrou requested a review from wesm July 14, 2020 13:54
@pitrou
Copy link
Member Author

pitrou commented Jul 14, 2020

Also cc @xhochy @nealrichardson

@pitrou
Copy link
Member Author

pitrou commented Jul 14, 2020

Valgrind works fine on this PR.

Modified function names:
* minmax -> min_max
* binary_isascii -> string_isascii (only works on string types)
* ascii_length -> binary_length (also make it work on binary types)
* binary_contains_exact -> match_substring
  (other possibility: has_substring ?)
* match -> index_in
* isin -> is_in
* list_value_lengths -> list_value_length
* partition_indices -> partition_nth_indices
  (other kinds of partitioning would be possible, e.g. using a predicate)

Document string predicate functions (ARROW-9444).

Also fix the allocation of IsValid output buffer in certain cases.
@pitrou pitrou force-pushed the ARROW-9390-compute-func-names branch from 0f1f905 to 66c0824 Compare July 14, 2020 14:18
@pitrou
Copy link
Member Author

pitrou commented Jul 14, 2020

Rebased and fixed CI (hopefully).

@jorisvandenbossche
Copy link
Member

binary_contains_exact -> match_substring (other possibility: has_substring ?)

On the PR (#7593 (comment)), @xhochy mentioned that the "exact" part was important to him

@pitrou
Copy link
Member Author

pitrou commented Jul 14, 2020

Well, "substring" clearly implies "exact". As for case-sensitivity, I hope that can be a flag in the options, because I don't think we want a combinatorial explosion of string matching functions.

@jorisvandenbossche
Copy link
Member

match -> index_in

+1, that's a much clearer name IMO (although the R people might disagree ;-))

@github-actions
Copy link

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this. Some minor comments which can be addressed as follow up also

auto func = std::make_shared<ScalarFunction>("binary_contains_exact", Arity::Unary());
auto exec_32 = BinaryContainsExact<StringType>::Exec;
auto exec_64 = BinaryContainsExact<LargeStringType>::Exec;
void AddMatchSubstring(FunctionRegistry* registry) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine, we can use match_substring_case_insensitive for the case insensitive version

cc @xhochy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is fine with me.

@wesm
Copy link
Member

wesm commented Jul 14, 2020

Does this conflict with 1d7d919 which I just merged?

@pitrou
Copy link
Member Author

pitrou commented Jul 14, 2020

No, it's rebased already.

@pitrou
Copy link
Member Author

pitrou commented Jul 14, 2020

@pitrou
Copy link
Member Author

pitrou commented Jul 14, 2020

I think this can be merged. Any concerns?

@wesm
Copy link
Member

wesm commented Jul 14, 2020

Nope. +1 here

@pitrou pitrou closed this in 10289a0 Jul 14, 2020
@pitrou pitrou deleted the ARROW-9390-compute-func-names branch July 14, 2020 16:05
Copy link
Member

@nealrichardson nealrichardson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't get to this before this merged, but a couple of thoughts.

@@ -944,7 +944,7 @@ void RegisterScalarStringAscii(FunctionRegistry* registry) {
MakeUnaryStringBatchKernel<AsciiUpper>("ascii_upper", registry);
MakeUnaryStringBatchKernel<AsciiLower>("ascii_lower", registry);

AddUnaryStringPredicate<IsAscii>("binary_isascii", registry);
AddUnaryStringPredicate<IsAscii>("string_isascii", registry);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't this string_is_ascii? It seems we prefer is_x to isx (cf. isin -> is_in in this PR).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Should we make a followup PR?
(if it's me, it's only tomorrow)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can make this change today unless someone else wants to

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also ascii_is_alnum, etc. ;-)

UnaryStringBenchmark(state, "binary_contains_exact", &options);
static void MatchSubstring(benchmark::State& state) {
MatchSubstringOptions options("abac");
UnaryStringBenchmark(state, "match_substring", &options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there clear criteria for when the function gets a binary_ or string_ prefix and when it doesn't?

Related, if we haven't documented the naming conventions (haven't seen it yet but I'll keep reading), we should so that future developers/selves know what to call new functions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. Personally, I find prefixes a bit pointless or even distracting.
(or, here, plain misleading, since "binary_contains_exact" didn't work on binary arrays, only string arrays...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps someone else wants to think about a convention?

pitrou added a commit to pitrou/arrow that referenced this pull request Jul 14, 2020
Followup from PR apache#7755: a new doc file wasn't checked in, and Sphinx only emits a warning.
pitrou added a commit that referenced this pull request Jul 14, 2020
Followup from PR #7755: a new doc file wasn't checked in, and Sphinx only emits a warning.

Closes #7760 from pitrou/ARROW-9390-missing-file

Authored-by: Antoine Pitrou <[email protected]>
Signed-off-by: Antoine Pitrou <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants