-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12290: [Rust][DataFusion] Add input_file_name function #9944
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this PR implements the intention very well (and, as always, is wonderfully tested @seddonm1 💯 🏅 ) .
I would like to see if we can figure out how to avoid having to pass an (almost always) unused Schema
parameter to every function, but if it is unavoidable then I can also live with this implementation
Thanks again @seddonm1
@@ -144,7 +144,7 @@ fn md5_array<T: StringOffsetSizeTrait>( | |||
} | |||
|
|||
/// crypto function that accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`] | |||
pub fn md5(args: &[ColumnarValue]) -> Result<ColumnarValue> { | |||
pub fn md5(args: &[ColumnarValue], _: &Schema) -> Result<ColumnarValue> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it is unfortunate that we had to plumb the schema
argument all the way through.
Another pattern I have seen (though it has its own downsides) is to use some sort of thread_local
storage to pass stuff like this (from the RecordBatch
s schema into the expression evaluation code).
So like before evaluating a physical_expr we would set some sort of thread_local that had a pointer back to the Schema
that input_file_name
could consult. This would avoid a bunch of plumbing through a mostly unused argument all over the place.
What do you think of this approach @jorgecarleitao and @Dandandan ?
(BTW there is a small lint error on the PR) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @seddonm1 , Thank you for this PR.
I think that this is a very important use-case :)
I do not think that this is the right design, and that we should think about another way of handling this.
My concerns:
-
we are using metadata to store this information, which means that we are reserving the key
filename
on the metadata for all dependencies of Arrow. This will cause us to incorrectly report lineage if someone writesfilename
to the metadata with something else (e.g. outside the context of the Arrow or DataFusion crate). This is will be difficult to identify, and will likely cause frustrations. -
we are initializing the arrow's CSV and JSON reader with a parameter that imo should not be defined in the context of Rust: a
Read
object does not need to be a file, and it is IMO confusing why we are passing afilename
and something that implementsRead
(are we reading fromRead
or creating a newFile
from that filename?) -
We are modifying the signature of all functions for the purposes of solving a relatively small use-case.
IMO that parameter should live entirely on DataFusion
, as part of the TableProvider
API, which is where we expose input-specific information about the table.
Can't we compute the lineage by transversing the DAG up to the leaves and introspect the tableProvider? E.g. use the visitor pattern to reach the leaves of the DAG (TableProvider
) and call the TableProvider::file_name()
or something. Something along the same lines that @alamb wrote to show the queries' logical and physical plan, but fetch the filename instead of the whole query.
Thanks both of you. I will have a look at the visitor pattern that @jorgecarleitao suggested as I agree this is quite dirty. Let met see what is possible and have another go. |
For lineage and diffing purposes (used by protocols like DeltaLake) it can be useful to know the source of input data for a Dataframe. This adds the
input_file_name
function which, like Spark, returns the name of the file being read, or NULL if not available.Unfortunately the Arrow RecordBatch does not have the ability to serialise this information correctly so this is runtime only. See: https://lists.apache.org/thread.html/rd1ab179db7e899635351df7d5de2286915cc439fd1f48e0057a373db%40%3Cdev.arrow.apache.org%3E