Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Using types and not strings to represent Paths #26

Closed
wants to merge 4 commits into from

Conversation

oxinabox
Copy link

@oxinabox oxinabox commented Feb 5, 2017

I started writing this over 4 months ago; but life got in the way.
Its now in a state to take feedback from others.
It could really do with it, I'm sure.

The current way path are handled as strings hasn't changed much since it was written,
mostly in JuliaLang/julia@6f9fb22
in January 2013.

Proposal Abstract:

Add a AbstractPath type and deprecate open(::AbstractString) in favour of open(::AbstractPath)
AbstractPaths allow code to be written without caring where or how the data is stored.
Using types for paths allow us to enforce some validity and constancy rules.
This also allows for multiple dispatch differentiating between a Path to a file, and that files contents as a string.

TODO

running todo list of things to add/change in the julep before it can stop being WIP

  • Section discussing how this relates to FileIO.jl
  • Determine whether or not this should apply to include.

@timholy
Copy link
Member

timholy commented Feb 5, 2017

It has a slightly different emphasis, but I'm surprised not to see a reference to FileIO.jl here. I imagine your AbsolutePath and RelativePath could be added there. I do think that FileIO is the de facto standard, so I'm not yet convinced that this has to be done in Base. Of course, adoption rates may vary in different niches within the package ecosystem.

@oxinabox
Copy link
Author

oxinabox commented Feb 5, 2017

I was actually thinking of FileIO.lj when I wrote this, I haven't yet wrote anything about it into the Julep.
My thoughts are that this Julep and FileIO.jl are highly complementary.

To get information out of a file (or file-like object) , you need two things

  • To know where it is (this Julep)
  • and to know how to interpret it (FileIO.jl)

I can see how this could fit into FileIO.jl, but it would be a widening of scope from FileIO.jl's current mission:
"FileIO aims to provide a common framework for detecting file formats and dispatching to appropriate readers/writers"

I see these as two separate but supporting tasks.
I am glad you brought this up, I'll make note to improve the julep with a extended version of this comment.

@tkelman
Copy link

tkelman commented Feb 5, 2017

@mbauman
Copy link
Member

mbauman commented Feb 6, 2017

I definitely see the case for adding path-like objects, but I'm not as certain about deprecating open(::AbstractString). It sounds like this is just a case where package authors need to widen their signatures. Would it solve your use-case if we generally advocated for methods to be written like f(::IO, ...) and f(x, ...) = open(io->f(io, ...), x)? If not, maybe you could add a bit more motivation for that change.

@oxinabox
Copy link
Author

oxinabox commented Feb 7, 2017

@mbauman maybe i am wrong wrong about the need to deprecate open(::AbstractString)

Maybe my instincts are wrong because I am not used having multiple-dispatch.
And perhaps because of my dislike for TIMTOADI
You are right that it package authors are used to PRs that ask them to widen their type signatures.

WRT encouraging the use of IO to match filenames:
It is a partial fix (and something we should be doing anyway) but issues remain:

  • It doesn't give validity or constancy in paths. It doesn't let us do things like handle: p"C:/Windows/" * p"/system32" correctly into p"C:/Windows/system32.
  • It doesn't let you have the 3 way dispatch: content-String, filepath, IO. You can as a package user get it by doing IOBuffer(content) (It is a bit messy, I would argue marginally worse than having to say p"/home/oxinabox/file.txt"; but it is alright.). But as a package author you can't have your IO method dispatch to your string method. So you end up creating a second function called foo_ etc which is just a bit cludgy.
  • Several packages load there data based on loading from a folder structure. eg you give a filepath to a folder "C:/data", and the package will load "C:/data/a.csv", "C:/data/b.csv", "C:/data/other/c.jld", "C:/data/other/d.rdata". By abstracting paths (and thus the join operation) we can (with a little hard thought) handle that case correctly whether the root directory is on a local hard drive, or in a hierarchical database (Or in any other weirder storage mechanism).

@tkelman
Copy link

tkelman commented Feb 7, 2017

The place where this would be a most valuable addition at this time is working towards a "virtual filesystem" abstraction which could then be used for code loading on remote nodes.

@StefanKarpinski
Copy link
Member

Apparently Racket does this and people have told me that it's a big win. Maybe check out what they do?

@c42f
Copy link
Member

c42f commented May 15, 2017

I'm very positive about this julep.

@mbauman - can you describe in more detail the practical reasons for keeping open(::AbstractString)? To me, the reasoning given in this julep is compelling and deprecating this function seems worthwhile.

As a point of comparison, I've always found the roughly equivalent thing in C++ (boost::filesystem::path) to be irritating, intrusive and more effort than it's worth. I think this may be because:

  1. There's a huge weight of older code which assumes paths are strings, so you're constantly having to convert them. (Addressed here with the deprecation.)
  2. Writing path literals is a pain. (Addressed here with string macros.)
  3. The added functionality is not huge. (Addressed here with the improvements to multiple dispatch.)

To expand on point 1, deprecating open() for strings is also important so that the entire julia ecosystem can get on with using AbstractPath earlier rather than later. If not, I think we might end up with some packages preferring strings (if nothing else due to historical inertia from other languages), others preferring AbstractPath, and still others allowing both, by widening their type signatures to Any in places.

On the subject of literals, a thorny question: Should @p_str return a platform native file path, or should it somehow be a platform-independent syntax? On the one hand, it seems important for REPL usage that platform native paths are supported. On the other hand, for relative paths in packages, platform independence is fairly important. In both cases, path literals may be desired.

@MikeInnes
Copy link
Member

I think Rust does this as well. They have something like a .to_path() trait on strings so you can use the basic apis as normal, so if we do a similar thing it doesn't have to be breaking. String macros can make this nice too, e.g. open(path"C:\foo").

@mbauman
Copy link
Member

mbauman commented May 15, 2017

@mbauman - can you describe in more detail the practical reasons for keeping open(::AbstractString)?

It's a large deprecation, open("path/to/file") is the obvious thing and shared across many languages, and having open("…") mean something different would be surprising and subtle. From what I can see in their docs, both Racket and Rust allow either paths or strings in their open equivalents.

That's not to say deprecating open(::AbstractString) definitely shouldn't happen, but it needs to be well-motivated to overcome those trade-offs.

@c42f
Copy link
Member

c42f commented May 16, 2017

Yes, I'm not so convinced about having open() with a string as the contents, just due to historical inertia from many languages. But regardless of that, the first step of making it work only with paths is appealing to me.

@oxinabox
Copy link
Author

Yes, I'm not so convinced about having open() with a string as the contents...

I see I was unclear here.
I do not suggest that open(str::AbstractString) become what is currently open(IOBuffer(str)).
I suggest that user functions like in

function foo_process(content::AbstractString)
...
end

#can co exist with:
foo_process(io::IO) = foo_process(readall(io))
foo_process(filename::AbstractPath) = open(foo_process, filename) #using open(::Function, ::AbstractPath)

But to accomplish this open(::AbstractString) is to be deprecated in favor of open(::AbstractPath)
just so that package maintainers get that kick to tell them to update all there foo_process(filename::AbstractString) type functions that take a filepath, to take an AbstractPath. (Of course they could be contrary, and add the path conversion in their code inside their code for foo_process(filename::AbstractString)

I suggest that open(::AbstractString) be deprecated without anything taking its place.
Maybe even left deprecated forever, with a like to the how julia is different to other languages documentation page.

I feel like adding the deprecation to open would be a faster way to get changes made.
The plan still works if open doesn't get deprecated.
Particularly if functions like joinpath is deprecated for join(::Path...),
and other such.
Which is useful in and of itself for handling system differences.

@c42f
Copy link
Member

c42f commented May 16, 2017

Sounds perfectly sensible.

What do you think about the return type of @path_str? Perhaps we could designate the posix syntax as the standard for relative paths, and have path"foo/bar" return a generic RelativePath type, which would then be usable as a relative path literal across different operating systems? Joining a PosixPath with a RelativePath would obviously then result in a PosixPath, and likewise for windows paths.

What about absolute paths, and how does this play nicely across systems where users will want to use the native path format in the REPL? Some systems adopt a platform-independent standard for writing path literals (eg, cmake, I think). The only libraries I recall parse strings as native paths, which is operating system dependent.

Side note - this stuff should work really nicely as a hint for tab completion.

@StefanKarpinski
Copy link
Member

I think we need to identify what the concrete advantages of path types are and then determine what we need to do to get those advantages. I'm not convinced that disallowing strings as path arguments is necessary to get the advantages. But then again, we're still a bit hazy on what the advantages are precisely, so clarifying that needs to be the next step.

@simonbyrne
Copy link
Contributor

It would be useful to link to what other languages with similar ideas have done (and, ideally, the reasons why they made those choices). For example, the rust RFC is here and points out many interesting issues (e.g. the possibility of unpaired UTF-16 surrogates in Windows paths).

@rofinn
Copy link

rofinn commented Nov 7, 2017

I've been using FilePaths.jl for a while now and here are some notes from my experience.

Advantages:

  1. Being able to dispatch on a path type (vs string) is really nice.
  2. Adding an extra character (e.g., p"~/.julia/v0.6/FilePaths/") isn't a big issue and often helps with code readability if I'm just seeing p"FilePaths".

Disadvantages:

The main issue I ran into was that interop with other packages can be a bit annoying. I was always writing String(path) so I opted to subtype AbstractString for practical reasons, but this can often introduce method ambiguities when porting existing code. If a path type was provided in base and more widely used I could see going back to not subtyping AbstractString, but for now this seems like the best middle ground.

Overall, I think having a minimal file system path type hierarchy in base with appropriate string conversions would be a good step forward.

@simonbyrne
Copy link
Contributor

I'm broadly in favour, but would prefer a non-single letter macro (perhaps path""?)

@rofinn
Copy link

rofinn commented Nov 7, 2017

Hmmm, I was mostly wanting to mimic r"^[a-z]*$"/Regex("^[a-z]*$") in base and if the macro name is too long it kind of defeats the point from my perspective (e.g., Path("FilePaths") is only two more characters than path"FilePaths").

@StefanKarpinski
Copy link
Member

So far the only advantage cited here is that "being able to dispatch on a path type is really nice". The fact that p"/path/to/file" is only a character longer than "/path/to/file" is not actually an advantage – it's an absence of much disadvantage. I have a gut feeling that there might be real advantages here, but if there are, they're not being conveyed very effectively.

@rofinn
Copy link

rofinn commented Nov 8, 2017

FWIW, my view was just that we have DateTime instead of Int, Regex instead of String and IPv4 instead of Int because they're distinct concepts that have specific rules associated with them that do not apply to the more general C-ish representations (e.g., match(::Regex, ...) makes more sense than match(::String)). Similarly, basename and parent don't really make sense for strings, but they do for filesystem paths. Looks like this is pretty much the same argument proposed for pathlib being included in the python stdlib. NOTE: I could also see an argument for having a URI type in base for similar reasons.

I can't think of any "real" advantages apart from having a type that is distinct from a general string for representing filesystem paths feels a bit more ergonomic and has helped me avoid a few bugs... but that also summarizes why I'm not writing my code in C :)

@c42f
Copy link
Member

c42f commented Nov 8, 2017

Yes, it's capturing the semantic that paths are a "different kind of thing" which makes this interesting. Being able to use dispatch effectively is the most obvious sign that this might be worthwhile. Here's some minor advantages related to literals:

  • We have cross platform relative path literals. For comparison I know that cmake does this by adopting the posix standard as the canonical separator when writing path literals.
  • Path literals allow for smarter tab completion (only do path completion for p"...")

But these advantages are a bit of a sideshow, I think.

Perhaps some concrete use cases might be helpful. Here's a contribution from me (apologies that it's not fully concrete, it reflects work I've done, but more in C++ than julia).

Say I want to write code which passes around either S3 URLs or file paths pointing to some point cloud data. I don't want to open the resources right away, so I need to pass around something which is an address for the data, which is a perfect use for AbstractPath. Eventually I want to pass my AbstractPath to a hypothetical ThirdPartyPackage.jl which calls LasIO.jl which opens the stream using open() and reads some point cloud data in LAS format. To get this to work, all packages need to agree that AbstractPath is the right thing to use, or fall back to Any for the resource names.

@simonbyrne
Copy link
Contributor

Other arguments:

  1. It distinguishes between cases where files and strings are both valid arguments. One case I came across recently was in SHA.jl, where sha1(::String) hashes the data in the string, but to hash the contents of a file you have to do SHA.sha1(open("filename")): this is different from similar functions such as Base.read.
    Similarly, we could use include(::String) instead of include_string.

  2. We can leverage dispatch for different types of paths, e.g.

include(path"filename.jl")
include(URI("www.example.com/run.jl"))
include(GitPath(repo, commit, pathinrepo))

could be defined, and then have recursive include calls work by overloading joinpath.

@StefanKarpinski
Copy link
Member

The distinction between String as data and String as location seems significant. The irregularity between path"filename.jl" and URI("www.example.com/run.jl") doesn't seem great. I could see url"www.example.com/run.jl" or go the other way and use functions for all of the above.

Other advantages that I might hope for with path types:

  • Writing p"foo/bar/baz.jl" would be normalized according to the system path separator so that it becomes foo/bar/baz.jl on UNIX systems and foo\bar\baz.jl on Windows. This lets you write path code the way you want to and get native paths consistently. The conversion step from string to path object would also be a natural point to normalize the name by changing / to \ on Windows and remove double // etc.

  • Treat paths as some kind of collection of path components, i.e. p"foo/bar/baz.jl"[end] gives you p"baz.jl" as just the filename part of a path or p"foo/bar/baz.jl"[1:end-1] to get the dirname part.

  • Have some convenient path-join syntax, e.g. p"foo" ++ p"bar" ++ p"baz.jl" which would do the equivalent of joinpath("foo", "bar", "baz.jl"). Using ++ here would dovetail well with treating path objects as indexable collections of path components.

  • Use different types for absolute and relative paths. E.g. p"/foo/bar" would be of type AbsolutePath while p"bar/baz" would be of type RelativePath. This seems like it could introduce some runtime type instability, but I actually think that a lot of the time this is fairly predictable. This would allow different behaviors for absolute and relative paths more elegantly.

@mauro3
Copy link

mauro3 commented Nov 24, 2017

Could this make ~/somefile and somedir/*.jpg work as expected?

@vtjnash
Copy link
Member

vtjnash commented Nov 25, 2017

I don't think that's necessarily connected to using types. You can do that already:

using Glob
readdir(glob"somedir/*.jpg", expanduser("~") #= aka homedir() =#)

@c42f
Copy link
Member

c42f commented Nov 25, 2017

Good point. Though manually having to type expanduser is not quite the slick experience you might hope for if you're used to path expansion in the shell.

[edit: TBH I've tried using the literal "~/blah" out of habit before, been unsurprised that it doesn't work, and looked no further. Path literals potentially give us the opportunity to make this "just work" for users.]

@twolodzko
Copy link

Check the issue linked above. It makes similar proposal.

Adding to what was said, beyond path literals, I propose having / method for joining the paths, so that they feel almost like system paths and are instantly readable: p"/home/username" / var_directory / "file.txt", like in Python's pathlib does.

Additionally, pathlib has some extra functionalities like iterating over the path parents, iterating over files within paths, wildchart paths etc., worth considering.

@rofinn
Copy link

rofinn commented Nov 13, 2020

FilePathsBase.jl already provides that functionality, though I don’t think the optional division operator overloading would make it into base.

rofinn/FilePathsBase.jl#53

@twolodzko
Copy link

@StefanKarpinski mentioned ++ as path join syntax, / is more consistent with how we write paths, so seems to be more self-explanatory for the user.

@rofinn
Copy link

rofinn commented Nov 13, 2020

I agree, which is why I used it in FilePathsBase. I think the issue is just that we don't want to have an operator like / mean two completely different things. Also, / is a pretty unix centric choice :)

Discussion about different operators here rofinn/FilePathsBase.jl#2

@vtjnash
Copy link
Member

vtjnash commented Nov 13, 2020

It's just logical: / is the file path divider

@oxinabox
Copy link
Author

oxinabox commented Nov 13, 2020

What we are actually doing when we write A/B is forming the quotient set of all paths with parent A, declaring equivelence as to if they further have the parent B or not,
then we enter the element of quotient set which was for the ones that do have parent B and consider them as if we had not applied the equivelence.

tl;dr; / is just a set quotient operation on the set of all filepath parents.

@twolodzko
Copy link

twolodzko commented Nov 13, 2020

@rofinn don't agree about /.

First, yes it is Unix convention, but people nowadays more often than not use Unix-like systems on their personal computers (Mac OS, even under Windows you can use Bash shell, or even Ubuntu as a "software"), or when working remotely (computational server, cloud computing, Docker etc), also URL's use this convention, so everyone seems to be familiar with it.

Second, currently * is used for concatenating strings. Honestly, I found * in Julia to be a strange choice, why would / be more "meaning two different things" than *? Using * for paths would be confusing (for me), since with paths we don't just concatenate them, but use joinpath that normalizes them. ++ proposed by @StefanKarpinski is used in Haskell for concatenating strings, so for combining string-like path objects it can be considered as confusing as well. Also, it's two characters in place where we could use one, and for strings most of the operators are not overloaded yet.

So / is simple and intuitive for paths. People coming from Python would find it as an almost instant replacement for pathlib functionality. Other users should find it similar to the system path, or URL separators.

@rofinn
Copy link

rofinn commented Nov 13, 2020

Again, that's largely why I opted to use / and keep it as an option. It just isn't available by default and I don't think it belongs in a base/stdlib implementation. I think it at least requires a using FilePathsBase: / from the end user to make it explicit what the syntax does.

@StefanKarpinski
Copy link
Member

Being fastidious about not punning on operators is a pretty core Julian principle. It's fine if people do it in their own code, but mixing up "divide" and "concatenate this path with this other path" in one generic function is not really cool.

@tpapp
Copy link

tpapp commented Nov 14, 2020

Second, currently * is used for concatenating strings. Honestly, I found * in Julia to be a strange choice

Some people do, but the choice is made now (there is even a FAQ about it), so simply using it for paths would be somewhat consistent, as pretty much all of the arguments apply in a similar way to strings.

@rofinn
Copy link

rofinn commented Nov 14, 2020

The problem with reusing * is that then we can't use it for normal string concatenation:

p"foo" / "bar" / "baz" * ".txt" == p"foo/bar/baz.txt"

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 14, 2020

Another possible approach is to allow interpolation into path strings with / in the string meaning path separator. I.e. p"/home/$user/$dir/$name.txt" would mean joinpath(Base.Filesystem.path_separator, "home", user, dir, "$name.txt"). An interesting question in that case would be what should happen if dir is an absolute path? Note that joinpath operation would discard the /home/$user part in that case. Of course, I don't think this problem is specific to the approach of writing p"/home/$user/$dir/$name.txt": the same question exists if you write p"/home" / user / dir / "$name.txt", in which case it also seems more surprising if dir being absolute caused you to get a path that didn't start with /home/$user than if you used joinpath.

@oxinabox
Copy link
Author

That seems interesting and I would need to think about it more.

But one key lack is that it can't be passed as an input to a higher order function and it can't be broadcast.

I often broadcast joinpath.
(Though probably less now that readdir has an option to give a full path)

@StefanKarpinski
Copy link
Member

I often broadcast joinpath.

😬

@tpapp
Copy link

tpapp commented Nov 15, 2020

The problem with reusing * is that then we can't use it for normal string concatenation:

Not if you disallow * on mixtures of paths and strings, and require that operands are made into paths instead, or chose semantics so that path * string is concatenated (without path separators) while path * path is joined with path separators. Eg the latter would be

p"foo" * p"bar" * p"baz" * ".txt" == p"foo/bar/baz.txt"

Personally, I am OK with joinpath.

@StefanKarpinski
Copy link
Member

I think the most reasonable path (😬) forward would be:

  • Treat path objects as strings for the purposes of * so that code that works on paths as strings can be used as-is on any mix of paths and string;
  • Keep using joinpath as the programmatic operator for path joins and make it work on any combination of paths and strings;
  • Support interpolation in path literals so that you can write thinsgs like p"/path/$var/$name.txt" and get a correct platform-specific path while using / as the path separator syntactically.

Path literals can also have features like making it easier to write Windows paths when you have to.

@oxinabox
Copy link
Author

Path literals can also have features like making it easier to write Windows paths when you have to

Hasn't windows accepted / or \ since like windows XP or something?

@StefanKarpinski
Copy link
Member

I thought there were situations where you need to use \ such as when specifying a drive? If not, then we can just require /.

@ararslan
Copy link
Member

UNC drives on Windows require \. There may be other cases as well, though / works most of the time.

@StefanKarpinski
Copy link
Member

The path literal approach is very flexible: if a path literal starts with a valid UNC drive sequence, then it can allow single backslashes in the rest. Another reason we may want to allow p"C:\path\to\blah" syntax is that it matches what gets printed in a lot of places, including ones that we don't control. Another thing to consider is that it may be fine to not have any escape syntax in path strings: putting sequences that require escapes in paths is very rare and if you really want to do it, you can always interpolate a string.

@c42f
Copy link
Member

c42f commented Nov 20, 2020

I've been working with abstracting data location recently (see DataSets.jl) and I've noticed anew that there's a really big difference in the genericity of relative vs absolute path types.

  • Relative paths are really simple and general: they're the keys() of any hierarchically-indexed directed-graph-like data structure. Usually we'd consider string path components, though in general you could have other things. (Eg paths into json-like data would contain integers for array indices.) A concrete type and string macro constructor for relative paths would be useful for many things not related to the filesystem. Things like the path components of URLs, etc.

  • Absolute paths address a resource which might or might not exist. They're formed of a path root and a relative path.

  • Path roots are where the complexity and system-dependence lies. Multi-root filesystems with drive letters, UNC paths, URL scheme and host info, etc etc. It's even useful to have roots which are Julia datastructures without any obvious string representation. The system dependent parsing rules for filesystem roots are a real problem when it comes to absolute path literals in portable code.

However I'd observe that portable code likely gets the path root from somewhere programmatically and rarely needs absolute paths. From this point of view, a relative path literal would be fine, especially if it could incorporate a few things like tilde expansion.

Alas, doing away with absolute path literals is not going to satisfy anyone who wants to write a quick script unless we've got a compelling replacement. For system dependent stuff, perhaps we could have winpath"C:\foo\bar" and posixpath"/foo/bar" etc.

What can you do with an abstract absolute path?

In generic code which takes AbstractAbsolutePath, you can

  • joinpath() with it and a relative path.
  • Use it for dispatch to distinguish from String and already opened resources like IO.

But other than that, I don't think it's clear what you can do! There's some other contenders for generic verbs but they have their problems

  • There's the family of functions isfile(), isdir(), stat() etc; but these are rather specific to the filesystem. Do they make sense for absolute path-like objects such as URLs?
  • There's the function open(), but this doesn't even apply to normal directory paths in a meaningful way. For directories, the closest equivalent is readdir().

If you think about open(path) long enough, you'll realize another problem: there's more than one way to reflect an abstract resource into the program as a Julia type. Even for normal file paths, you have the options in open() vs mmap(). In DataSets.jl, I'm experimenting with open(T, path) (where path::DataSet) when trying to attack this problem.

@vtjnash
Copy link
Member

vtjnash commented Feb 16, 2022

I think the suggested path forward here is https://github.com/rofinn/FilePaths.jl. We aren't going to make breaking changes to the file system functions in base to stop using string, and I think it does make the most sense for the primary API to be strings, but with the option for the user to layer a more advanced type on top (particularly for more complex cases such as non-local resources)

But this could be a discourse post or discussion here, if we want to continue with the julep proposal written here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.