Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DISCUSS] organization of functions #9100

Closed
alamb opened this issue Feb 1, 2024 · 11 comments
Closed

[DISCUSS] organization of functions #9100

alamb opened this issue Feb 1, 2024 · 11 comments
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Feb 1, 2024

Is your feature request related to a problem or challenge?

As we break the function library out of the DataFusion core as part of #8045, we need to organize it somehow

As pointed out by @viirya on #8705 (comment) there are many potential ways to organize the data fusion function library. Thus I would like to get some consensus on how we want the organization to look before creating tickets and starting to crank it out

Describe the solution you'd like

Here is a proposal for how the functions are organized.

math functions

  • feature_flag (new): math_expressions
  • code location: datafusion/functions/src/math
  • Abs, Acos, Asin, Atan, Atan2, Acosh, Asinh, Atanh, Cbrt, Ceil, Cos, Cosh, Degrees, Exp, Factorial, Floor, Gcd, Lcm, Ln, Log, Log10, Log2, Pi, Power, Radians, Signum, Sin, Sinh, Sqrt, Tan, Tanh, Trunc, Cot, Round, iszero
  • Isnan: is the value NaN
  • Nanvl: return the first non-NaN value

array functions

Given the size and specialization of these functions I propose putting them into their own subcrate

  • feature_flag (new): array_expressions
  • code location: datafusion/functions-array/src/math
  • ArrayAppend, ArraySort, ArrayConcat, ArrayHas, ArrayHasAll, ArrayHasAny, ArrayPopFront, ArrayPopBack, ArrayDims, ArrayDistinct, ArrayElement, ArrayEmpty, ArrayLength, ArrayNdims, ArrayPosition, ArrayPositions, ArrayPrepend, ArrayRemove, ArrayRemoveN, ArrayRemoveAll, ArrayRepeat, ArrayReplace, ArrayReplaceN, ArrayReplaceAll, ArraySlice, ArrayToString, ArrayIntersect, ArrayUnion, ArrayExcept, Cardinality, ArrayResize, Flatten, Range, StringToArray,
  • MakeArray: construct an array from columns (union/except depends on this)

Core functions

These functions are always available as they are used for internal purposes (like implementing [1,2,3] syntax in SQL or so commonly used that it is not worth having a feature flag)

  • feature_flag: NONE
  • code location: datafusion/functions/src/core or similar
  • Coalesce: return the first non-null value
  • Struct: Create a struct
  • NullIf: return null if the two values are equal
  • Random: return a random number
  • ArrowTypeOf: return the arrow type of a value

Crypto functions

  • feature_flag (existing): crypto_expressions
  • code location: datafusion/functions/src/crypto
  • Digest, MD5, SHA224, SHA256, SHA384, SHA512

String functions

  • feature_flag (new): string_expressions
  • code location: datafusion/functions/src/string
  • ascii, bit_length, btrim, chr, concat, concat_ws, ends_with, initcap, instr, lower, ltrim, octet_length, repeat, replace, rtrim, split_part, starts_with, to_hex, trim, upper, levenshtein, uuid, overlay

Unicode string functions

These expressions need an additioanl dependency, which is why they have a different flag)

  • feature_flag (existing): unicode_expressions
  • code location: datafusion/functions/src/string/unicode
  • CharLength, Left, Lpad, Reverse, Right, Rpad, Strpos, Substr, Translate, SubstrIndex, FindInSet

regex functions

  • feature_flag (existing): regex_expressions
  • code location: datafusion/functions/src/regexp
  • RegexpMatch, RegexpReplace

date time function

  • feature_flag (new): datetime_expressions
  • code location: datafusion/functions/src/datetime
  • date_part, date_trunc, date_bin, to_timestamp, to_timestamp_millis, to_timestamp_micros, to_timestamp_nanos, to_timestamp_seconds, from_unixtime, now, current_date, current_time

Describe alternatives you've considered

We can have more fine grained crates, or different organizations, etc

For example, perhaps we should pull the string functions into datafusion/functions-string crate

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented Feb 1, 2024

FYI @Weijun-H and @jayzhan211 -- do you have any thoughts about moving the array functions into their own crate?

@jayzhan211
Copy link
Contributor

jayzhan211 commented Feb 1, 2024

I think it is a good idea to move array functions into their crate given the size of it. It will become more complex since there are still tons of array functions not yet implemented. For other functions like string, if they are large enough, we should move them into subcrate as well. I think we should keep the core as simple as possible.

Core functions crate like the standard library in rust, others go to sub-crate like third-party crates

@Weijun-H
Copy link
Member

Weijun-H commented Feb 1, 2024

It is timely to address the issues with array_experssion.rs, as @jayzhan211 stated that a lot of work still needs to be done on the array. I strongly support this ticket.

@alamb
Copy link
Contributor Author

alamb commented Feb 2, 2024

I whipped up an example of how an datafusion-functions-array crate might look: #9113

@Omega359
Copy link
Contributor

Omega359 commented Feb 2, 2024

Minor: can we match the feature flag prefix with the source code location? Just seems cleaner to me.

string_expressions
code location: datafusion/functions/src/string

unicode_expressions
code location: datafusion/functions/src/unicode

regexp_expressions
code location: datafusion/functions/src/regexp

.. etc

@alamb
Copy link
Contributor Author

alamb commented Feb 8, 2024

@Omega359 asked a good question on #8705 (comment) that I wanted to not get lost in comments on a merged pull request

  • Where should public rustdocs for the function go?

I think they should go in the datafusion-functions crate which is then both re-exported by the main datafusion crate:
Screenshot 2024-02-08 at 6 05 00 AM

Following the model of the existing expr_fns the encode and decode functions were re-exported in datafusion::prelude as well:

Screenshot 2024-02-08 at 6 06 33 AM
  • Benchmarks? Tests && test data? Is that all going to live at functions/benches and functions/tests ?

I think we should do this if/when possible -- benchmarks that depend on SessionContext for example probably need to stay in datafusion/core but otherwise we should move them to datafusion-functions

@alamb
Copy link
Contributor Author

alamb commented Feb 12, 2024

For anyone following along, we just merged #9113 to break out the array functions crate

What I plan to do this week is to make a few other PRs moving some functions so we have some more examples, and then file a bunch of "help wanted / good first issue tickets" to port the remaining functions per the list on this ticket

@alamb
Copy link
Contributor Author

alamb commented Feb 13, 2024

Note for myself: I tried to pull MakeArray into datafusion-functions core, but I found several other array functions (like union) rely on it, so I think it needs to go into datafusion-functions-array

@alamb
Copy link
Contributor Author

alamb commented Feb 20, 2024

I filed #9285 to track the actual work, let's continue the discussion there

@alamb alamb closed this as completed Feb 20, 2024
@l1t1
Copy link

l1t1 commented Feb 22, 2024

please add table functions, eg. generate_series

@alamb
Copy link
Contributor Author

alamb commented Feb 26, 2024

please add table functions, eg. generate_series

Seems like this request was made in #9323 ✅ -- thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants