Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to include source filename and filepath in dataframe #10481

Closed
Tracked by #15441
D1xieFlatline opened this issue Aug 14, 2023 · 12 comments · Fixed by #17563
Closed
Tracked by #15441

Add option to include source filename and filepath in dataframe #10481

D1xieFlatline opened this issue Aug 14, 2023 · 12 comments · Fixed by #17563
Assignees
Labels
A-io Area: reading and writing data accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium

Comments

@D1xieFlatline
Copy link

Problem description

When reading data from a large number of files, it can be helpful to keep track of the source file for a few reasons:

  • Identifying the source of data issues
  • Ability to reload a specific file rather than refreshing the entire data set
  • Giving users visibility into where data came from

You can always capture the file name/path in a variable and add that to the df after a file is loaded, but this creates extra steps and doesn't seem to work well with globs.

Adding options to include the name/path of source files when a file is read would be a nice quality of life feature.

I propose adding twonew parameters to each file input function:

  • include_source_path: If True, includes the file path as an additional column. Default: False
  • include_source_name: If True, includes the file name as an additional column. Default: False
@D1xieFlatline D1xieFlatline added the enhancement New feature or an improvement of an existing feature label Aug 14, 2023
@MarcoGorelli
Copy link
Collaborator

Hey - I think #5117 (comment) would allow you to do this

@D1xieFlatline
Copy link
Author

D1xieFlatline commented Aug 14, 2023

Hi Marco, thanks for sharing that enhancement. I think that looks useful, but after reading the linked examples I see two key differences from this suggestion:

  • Users need to tag metadata rather than providing an option to capture it automatically when a dataframe is created
  • It doesn't appear that metadata is automatically passed on when a dataframe is written to a file

I usually add some variation of these three lines every time I read a file into a dataframe, and I think I'd still need to do some version of that if #5117 was implemented.

pl.lit(source_path).alias('SOURCE_FILEPATH'),
pl.lit(source_file).alias('SOURCE_FILENAME'),
pl.lit(source_sheet).alias('SOURCE_SHEETNAME'),

Maybe a good compromise would be to capture any parameter that's used to read a file as metadata by default?

@MarcoGorelli
Copy link
Collaborator

Thanks for your suggestion!

Writing

pl.lit(source_path).alias('SOURCE_FILEPATH'),
pl.lit(source_file).alias('SOURCE_FILENAME'),

looks fine, I don't really see the advantage compared with

include_source_path=True,
include_source_name=True,

Closing then

@MSKDom
Copy link

MSKDom commented Mar 3, 2024

I'm sorry but how does the above actually address the concerns? If you are doing it file by file then it is fine, but using bulk read commands do not expose that option.

@MarcoGorelli
Copy link
Collaborator

Thanks for the ping - reopening for now, will take another look in the week

@MarcoGorelli MarcoGorelli reopened this Mar 3, 2024
@MarcoGorelli MarcoGorelli added the A-io Area: reading and writing data label Mar 3, 2024
@MSKDom
Copy link

MSKDom commented Mar 4, 2024

Just to clarify and add some context.

I've used CSV bulk read on path e.g. foo/*/*.csv. Part of the filename is a timestamp of when it was processed/uploaded, so something like this would solve the issue as I could simply apply a transform on that column to extract it.

Instead I had to traverse the file tree, use existing standard library globing to find matches and read file by file. Not a massive workaround, but does hinder the ability to use built in methods as well as making it a bit slower.

@cmdlineluser
Copy link
Contributor

I think another valid use case is when using a remote glob? e.g. scan_csv("http://.../foo/*.csv")

In which case the workaround approach is not applicable.

For reference, DuckDB has filename=true for its readers

@MarcoGorelli MarcoGorelli added the accepted Ready for implementation label Mar 8, 2024
@github-project-automation github-project-automation bot moved this to Ready in Backlog Mar 8, 2024
@MarcoGorelli
Copy link
Collaborator

per discussion: accepted (pending some discussion on the dtype of the filename column)

@deanm0000
Copy link
Collaborator

I wonder if this would intrinsically fix #14936 or if that issue would still persist.

@MSKDom
Copy link

MSKDom commented Apr 3, 2024

It would persist @deanm0000. This only appends metadata to files so I would guess it's a different area involved

@klwlevy
Copy link

klwlevy commented Apr 15, 2024

I have the same problem as described by @MSKDom, i.e. having to loop over files to import instead of bulk loading because I need some filename info inside my created DataFrame.
A simple but somewhat inelegant workaround is to use Duckdb instead of polars for the bulk loading of flat files.

@pietrolesci
Copy link

+1. Having a functionality similar to DuckDB's filename flag would be great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io Area: reading and writing data accepted Ready for implementation enhancement New feature or an improvement of an existing feature P-medium Priority: medium
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

8 participants