-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to include source filename and filepath in dataframe #10481
Comments
Hey - I think #5117 (comment) would allow you to do this |
Hi Marco, thanks for sharing that enhancement. I think that looks useful, but after reading the linked examples I see two key differences from this suggestion:
I usually add some variation of these three lines every time I read a file into a dataframe, and I think I'd still need to do some version of that if #5117 was implemented.
Maybe a good compromise would be to capture any parameter that's used to read a file as metadata by default? |
Thanks for your suggestion! Writing pl.lit(source_path).alias('SOURCE_FILEPATH'),
pl.lit(source_file).alias('SOURCE_FILENAME'), looks fine, I don't really see the advantage compared with
Closing then |
I'm sorry but how does the above actually address the concerns? If you are doing it file by file then it is fine, but using bulk read commands do not expose that option. |
Thanks for the ping - reopening for now, will take another look in the week |
Just to clarify and add some context. I've used CSV bulk read on path e.g. Instead I had to traverse the file tree, use existing standard library globing to find matches and read file by file. Not a massive workaround, but does hinder the ability to use built in methods as well as making it a bit slower. |
I think another valid use case is when using a remote glob? e.g. In which case the workaround approach is not applicable. For reference, DuckDB has |
per discussion: accepted (pending some discussion on the dtype of the |
I wonder if this would intrinsically fix #14936 or if that issue would still persist. |
It would persist @deanm0000. This only appends metadata to files so I would guess it's a different area involved |
I have the same problem as described by @MSKDom, i.e. having to loop over files to import instead of bulk loading because I need some filename info inside my created DataFrame. |
+1. Having a functionality similar to DuckDB's |
Problem description
When reading data from a large number of files, it can be helpful to keep track of the source file for a few reasons:
You can always capture the file name/path in a variable and add that to the df after a file is loaded, but this creates extra steps and doesn't seem to work well with globs.
Adding options to include the name/path of source files when a file is read would be a nice quality of life feature.
I propose adding twonew parameters to each file input function:
The text was updated successfully, but these errors were encountered: