ARROW-12290: [Rust][DataFusion] Add input_file_name function #9944

seddonm1 · 2021-04-08T02:30:02Z

For lineage and diffing purposes (used by protocols like DeltaLake) it can be useful to know the source of input data for a Dataframe. This adds the input_file_name function which, like Spark, returns the name of the file being read, or NULL if not available.

Unfortunately the Arrow RecordBatch does not have the ability to serialise this information correctly so this is runtime only. See: https://lists.apache.org/thread.html/rd1ab179db7e899635351df7d5de2286915cc439fd1f48e0057a373db%40%3Cdev.arrow.apache.org%3E

github-actions · 2021-04-08T02:38:15Z

https://issues.apache.org/jira/browse/ARROW-12290

alamb

I think this PR implements the intention very well (and, as always, is wonderfully tested @seddonm1 💯 🏅 ) .

I would like to see if we can figure out how to avoid having to pass an (almost always) unused Schema parameter to every function, but if it is unavoidable then I can also live with this implementation

Thanks again @seddonm1

alamb · 2021-04-08T22:10:34Z

rust/datafusion/src/physical_plan/crypto_expressions.rs

@@ -144,7 +144,7 @@ fn md5_array<T: StringOffsetSizeTrait>(
 }

 /// crypto function that accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`]
-pub fn md5(args: &[ColumnarValue]) -> Result<ColumnarValue> {
+pub fn md5(args: &[ColumnarValue], _: &Schema) -> Result<ColumnarValue> {


Yeah, it is unfortunate that we had to plumb the schema argument all the way through.

Another pattern I have seen (though it has its own downsides) is to use some sort of thread_local storage to pass stuff like this (from the RecordBatchs schema into the expression evaluation code).

So like before evaluating a physical_expr we would set some sort of thread_local that had a pointer back to the Schema that input_file_name could consult. This would avoid a bunch of plumbing through a mostly unused argument all over the place.

What do you think of this approach @jorgecarleitao and @Dandandan ?

alamb · 2021-04-08T22:12:25Z

(BTW there is a small lint error on the PR)

jorgecarleitao

Hi @seddonm1 , Thank you for this PR.

I think that this is a very important use-case :)

I do not think that this is the right design, and that we should think about another way of handling this.

My concerns:

we are using metadata to store this information, which means that we are reserving the key filename on the metadata for all dependencies of Arrow. This will cause us to incorrectly report lineage if someone writes filename to the metadata with something else (e.g. outside the context of the Arrow or DataFusion crate). This is will be difficult to identify, and will likely cause frustrations.
we are initializing the arrow's CSV and JSON reader with a parameter that imo should not be defined in the context of Rust: a Read object does not need to be a file, and it is IMO confusing why we are passing a filename and something that implements Read (are we reading from Read or creating a new File from that filename?)
We are modifying the signature of all functions for the purposes of solving a relatively small use-case.

IMO that parameter should live entirely on DataFusion, as part of the TableProvider API, which is where we expose input-specific information about the table.

Can't we compute the lineage by transversing the DAG up to the leaves and introspect the tableProvider? E.g. use the visitor pattern to reach the leaves of the DAG (TableProvider) and call the TableProvider::file_name() or something. Something along the same lines that @alamb wrote to show the queries' logical and physical plan, but fetch the filename instead of the whole query.

seddonm1 · 2021-04-09T06:44:34Z

Thanks both of you. I will have a look at the visitor pattern that @jorgecarleitao suggested as I agree this is quite dirty. Let met see what is possible and have another go.

input_file_name

134e3b5

github-actions bot added Component: Rust - DataFusion Component: Rust labels Apr 8, 2021

alamb approved these changes Apr 8, 2021

View reviewed changes

jorgecarleitao requested changes Apr 9, 2021

View reviewed changes

seddonm1 closed this Apr 9, 2021

seddonm1 mentioned this pull request Apr 10, 2021

ARROW-12290: [Rust][DataFusion] Add input_file_name function [WIP] #9976

Closed

This was referenced Apr 18, 2023

Implement merge command delta-io/delta-rs#850

Closed

Add input_file_name built-in function apache/datafusion#6051

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12290: [Rust][DataFusion] Add input_file_name function #9944

ARROW-12290: [Rust][DataFusion] Add input_file_name function #9944

seddonm1 commented Apr 8, 2021

github-actions bot commented Apr 8, 2021

alamb left a comment

alamb Apr 8, 2021

alamb commented Apr 8, 2021

jorgecarleitao left a comment

seddonm1 commented Apr 9, 2021

ARROW-12290: [Rust][DataFusion] Add input_file_name function #9944

ARROW-12290: [Rust][DataFusion] Add input_file_name function #9944

Conversation

seddonm1 commented Apr 8, 2021

github-actions bot commented Apr 8, 2021

alamb left a comment

Choose a reason for hiding this comment

alamb Apr 8, 2021

Choose a reason for hiding this comment

alamb commented Apr 8, 2021

jorgecarleitao left a comment

Choose a reason for hiding this comment

seddonm1 commented Apr 9, 2021