Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12290: [Rust][DataFusion] Add input_file_name function #9944

Closed
wants to merge 1 commit into from

Conversation

seddonm1
Copy link
Contributor

@seddonm1 seddonm1 commented Apr 8, 2021

For lineage and diffing purposes (used by protocols like DeltaLake) it can be useful to know the source of input data for a Dataframe. This adds the input_file_name function which, like Spark, returns the name of the file being read, or NULL if not available.

Unfortunately the Arrow RecordBatch does not have the ability to serialise this information correctly so this is runtime only. See: https://lists.apache.org/thread.html/rd1ab179db7e899635351df7d5de2286915cc439fd1f48e0057a373db%40%3Cdev.arrow.apache.org%3E

@github-actions
Copy link

github-actions bot commented Apr 8, 2021

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR implements the intention very well (and, as always, is wonderfully tested @seddonm1 💯 🏅 ) .

I would like to see if we can figure out how to avoid having to pass an (almost always) unused Schema parameter to every function, but if it is unavoidable then I can also live with this implementation

Thanks again @seddonm1

@@ -144,7 +144,7 @@ fn md5_array<T: StringOffsetSizeTrait>(
}

/// crypto function that accepts Utf8 or LargeUtf8 and returns a [`ColumnarValue`]
pub fn md5(args: &[ColumnarValue]) -> Result<ColumnarValue> {
pub fn md5(args: &[ColumnarValue], _: &Schema) -> Result<ColumnarValue> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is unfortunate that we had to plumb the schema argument all the way through.

Another pattern I have seen (though it has its own downsides) is to use some sort of thread_local storage to pass stuff like this (from the RecordBatchs schema into the expression evaluation code).

So like before evaluating a physical_expr we would set some sort of thread_local that had a pointer back to the Schema that input_file_name could consult. This would avoid a bunch of plumbing through a mostly unused argument all over the place.

What do you think of this approach @jorgecarleitao and @Dandandan ?

@alamb
Copy link
Contributor

alamb commented Apr 8, 2021

(BTW there is a small lint error on the PR)

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @seddonm1 , Thank you for this PR.

I think that this is a very important use-case :)

I do not think that this is the right design, and that we should think about another way of handling this.

My concerns:

  1. we are using metadata to store this information, which means that we are reserving the key filename on the metadata for all dependencies of Arrow. This will cause us to incorrectly report lineage if someone writes filename to the metadata with something else (e.g. outside the context of the Arrow or DataFusion crate). This is will be difficult to identify, and will likely cause frustrations.

  2. we are initializing the arrow's CSV and JSON reader with a parameter that imo should not be defined in the context of Rust: a Read object does not need to be a file, and it is IMO confusing why we are passing a filename and something that implements Read (are we reading from Read or creating a new File from that filename?)

  3. We are modifying the signature of all functions for the purposes of solving a relatively small use-case.

IMO that parameter should live entirely on DataFusion, as part of the TableProvider API, which is where we expose input-specific information about the table.

Can't we compute the lineage by transversing the DAG up to the leaves and introspect the tableProvider? E.g. use the visitor pattern to reach the leaves of the DAG (TableProvider) and call the TableProvider::file_name() or something. Something along the same lines that @alamb wrote to show the queries' logical and physical plan, but fetch the filename instead of the whole query.

@seddonm1
Copy link
Contributor Author

seddonm1 commented Apr 9, 2021

Thanks both of you. I will have a look at the visitor pattern that @jorgecarleitao suggested as I agree this is quite dirty. Let met see what is possible and have another go.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants