Add option to include source filename and filepath in dataframe #10481

D1xieFlatline · 2023-08-14T16:38:20Z

Problem description

When reading data from a large number of files, it can be helpful to keep track of the source file for a few reasons:

Identifying the source of data issues
Ability to reload a specific file rather than refreshing the entire data set
Giving users visibility into where data came from

You can always capture the file name/path in a variable and add that to the df after a file is loaded, but this creates extra steps and doesn't seem to work well with globs.

Adding options to include the name/path of source files when a file is read would be a nice quality of life feature.

I propose adding twonew parameters to each file input function:

include_source_path: If True, includes the file path as an additional column. Default: False
include_source_name: If True, includes the file name as an additional column. Default: False

MarcoGorelli · 2023-08-14T17:26:37Z

Hey - I think #5117 (comment) would allow you to do this

D1xieFlatline · 2023-08-14T19:16:41Z

Hi Marco, thanks for sharing that enhancement. I think that looks useful, but after reading the linked examples I see two key differences from this suggestion:

Users need to tag metadata rather than providing an option to capture it automatically when a dataframe is created
It doesn't appear that metadata is automatically passed on when a dataframe is written to a file

I usually add some variation of these three lines every time I read a file into a dataframe, and I think I'd still need to do some version of that if #5117 was implemented.

pl.lit(source_path).alias('SOURCE_FILEPATH'),
pl.lit(source_file).alias('SOURCE_FILENAME'),
pl.lit(source_sheet).alias('SOURCE_SHEETNAME'),

Maybe a good compromise would be to capture any parameter that's used to read a file as metadata by default?

MarcoGorelli · 2023-08-14T20:34:54Z

Thanks for your suggestion!

Writing

pl.lit(source_path).alias('SOURCE_FILEPATH'),
pl.lit(source_file).alias('SOURCE_FILENAME'),

looks fine, I don't really see the advantage compared with

include_source_path=True,
include_source_name=True,

Closing then

MSKDom · 2024-03-03T22:22:36Z

I'm sorry but how does the above actually address the concerns? If you are doing it file by file then it is fine, but using bulk read commands do not expose that option.

MarcoGorelli · 2024-03-03T22:39:08Z

Thanks for the ping - reopening for now, will take another look in the week

MSKDom · 2024-03-04T11:25:30Z

Just to clarify and add some context.

I've used CSV bulk read on path e.g. foo/*/*.csv. Part of the filename is a timestamp of when it was processed/uploaded, so something like this would solve the issue as I could simply apply a transform on that column to extract it.

Instead I had to traverse the file tree, use existing standard library globing to find matches and read file by file. Not a massive workaround, but does hinder the ability to use built in methods as well as making it a bit slower.

cmdlineluser · 2024-03-04T13:00:15Z

I think another valid use case is when using a remote glob? e.g. scan_csv("http://.../foo/*.csv")

In which case the workaround approach is not applicable.

For reference, DuckDB has filename=true for its readers

https://duckdb.org/docs/data/multiple_files/overview#filename

MarcoGorelli · 2024-03-08T13:37:10Z

per discussion: accepted (pending some discussion on the dtype of the filename column)

deanm0000 · 2024-04-02T14:35:15Z

I wonder if this would intrinsically fix #14936 or if that issue would still persist.

MSKDom · 2024-04-03T13:33:01Z

It would persist @deanm0000. This only appends metadata to files so I would guess it's a different area involved

klwlevy · 2024-04-15T20:35:13Z

I have the same problem as described by @MSKDom, i.e. having to loop over files to import instead of bulk loading because I need some filename info inside my created DataFrame.
A simple but somewhat inelegant workaround is to use Duckdb instead of polars for the bulk loading of flat files.

pietrolesci · 2024-04-22T10:59:40Z

+1. Having a functionality similar to DuckDB's filename flag would be great!

D1xieFlatline added the enhancement New feature or an improvement of an existing feature label Aug 14, 2023

MarcoGorelli closed this as completed Aug 14, 2023

MarcoGorelli reopened this Mar 3, 2024

MarcoGorelli added the A-io Area: reading and writing data label Mar 3, 2024

MarcoGorelli added the accepted Ready for implementation label Mar 8, 2024

github-project-automation bot added this to Backlog Mar 8, 2024

github-project-automation bot moved this to Ready in Backlog Mar 8, 2024

cmdlineluser mentioned this issue Apr 2, 2024

Add filenames to parquet reading exceptions #15429

Open

nameexhaustion mentioned this issue Jul 8, 2024

Hive partitioning tracking issue #15441

Closed

13 tasks

nameexhaustion self-assigned this Jul 8, 2024

nameexhaustion added the P-medium Priority: medium label Jul 8, 2024

nameexhaustion mentioned this issue Jul 11, 2024

feat: Add option to include file path for Parquet, IPC, CSV scans #17563

Merged

ritchie46 closed this as completed in #17563 Jul 12, 2024

github-project-automation bot moved this from Ready to Done in Backlog Jul 12, 2024

nameexhaustion mentioned this issue Jul 17, 2024

feat: Include file path option for NDJSON #17681

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to include source filename and filepath in dataframe #10481

Add option to include source filename and filepath in dataframe #10481

D1xieFlatline commented Aug 14, 2023

MarcoGorelli commented Aug 14, 2023

D1xieFlatline commented Aug 14, 2023 •

edited

Loading

MarcoGorelli commented Aug 14, 2023

MSKDom commented Mar 3, 2024

MarcoGorelli commented Mar 3, 2024

MSKDom commented Mar 4, 2024 •

edited

Loading

cmdlineluser commented Mar 4, 2024

MarcoGorelli commented Mar 8, 2024

deanm0000 commented Apr 2, 2024

MSKDom commented Apr 3, 2024 •

edited

Loading

klwlevy commented Apr 15, 2024 •

edited

Loading

pietrolesci commented Apr 22, 2024

Add option to include source filename and filepath in dataframe #10481

Add option to include source filename and filepath in dataframe #10481

Comments

D1xieFlatline commented Aug 14, 2023

Problem description

MarcoGorelli commented Aug 14, 2023

D1xieFlatline commented Aug 14, 2023 • edited Loading

MarcoGorelli commented Aug 14, 2023

MSKDom commented Mar 3, 2024

MarcoGorelli commented Mar 3, 2024

MSKDom commented Mar 4, 2024 • edited Loading

cmdlineluser commented Mar 4, 2024

MarcoGorelli commented Mar 8, 2024

deanm0000 commented Apr 2, 2024

MSKDom commented Apr 3, 2024 • edited Loading

klwlevy commented Apr 15, 2024 • edited Loading

pietrolesci commented Apr 22, 2024

D1xieFlatline commented Aug 14, 2023 •

edited

Loading

MSKDom commented Mar 4, 2024 •

edited

Loading

MSKDom commented Apr 3, 2024 •

edited

Loading

klwlevy commented Apr 15, 2024 •

edited

Loading