Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding sort=true keyword to readdir() #24675

Closed
wants to merge 8 commits into from
Closed

Adding sort=true keyword to readdir() #24675

wants to merge 8 commits into from

Conversation

carstenbauer
Copy link
Member

See #24626.

Copy link
Member

@StefanKarpinski StefanKarpinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a test to verify that entries are sorted, and an entry in NEWS.md this should be good to go. Great PR!

base/file.jl Outdated Show resolved Hide resolved
@kshyatt kshyatt added io Involving the I/O subsystem: libuv, read, write, etc. needs news A NEWS entry is required for this change needs tests Unit tests are required for this change labels Nov 21, 2017
test/file.jl Outdated Show resolved Hide resolved
base/file.jl Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
NEWS.md Outdated Show resolved Hide resolved
@StefanKarpinski
Copy link
Member

How much CI breakage are we expecting these days? These all seem unrelated but that's not as much green as I'd like to see before merging. In particular, no Windows tests passing is unfortunate since the new tests would have passed previously on other OSes.

@Sacha0
Copy link
Member

Sacha0 commented Nov 22, 2017

The Travis i686 failure is another instance of the presently common libgit2 failure I mentioned last night on slack/infrastructure. Other instances:

https://travis-ci.org/JuliaLang/julia/jobs/305354871
https://travis-ci.org/JuliaLang/julia/jobs/305373961
https://travis-ci.org/JuliaLang/julia/jobs/305354871
https://travis-ci.org/JuliaLang/julia/jobs/305535401

@omus mentioned that #24338 might help sort that failure out? Best!

@tkelman
Copy link
Contributor

tkelman commented Nov 22, 2017

What kind of code are you (or @bjarthur) writing that's sensitive to the order here? A case sensitive sorting usually doesn't make sense on Windows.

@StefanKarpinski
Copy link
Member

A case sensitive sorting usually doesn't make sense on Windows.

Why not? The point of this change is so that the default behavior is more platform independent, not to match any system-specific behavior. If you really want the native sort order, you can request it by doing readdir(sort=false) which clearly warns the user that they may not get names in order (even if they happen to on their current system).

@omus
Copy link
Member

omus commented Nov 22, 2017

The Travis i686 failure is another instance of the presently common libgit2 failure I mentioned last night on slack/infrastructure

I'm looking into it currently.

@tkelman
Copy link
Contributor

tkelman commented Nov 22, 2017

File system behavior isn't platform independent, and it doesn't seem worth making the default slower to pretend it is here. When working with the file system, seems more direct to return what the system call gives you, and add a call to sort! in user code where this actually makes a difference. Isn't composability and doing the fast thing by default the general julian API style?

@rfourquet
Copy link
Member

Isn't composability and doing the fast thing by default the general julian API style?

Cf. #939 (comment)

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 22, 2017

and it doesn't seem worth making the default slower to pretend it is here

How many file names do you expect to be sorting? The cost of sorting an array of short strings is going to be dwarfed by the cost of the readdir operation.

Isn't composability and doing the fast thing by default the general julian API style?

Doing the correct thing by default where it is reasonably fast is the Julian style; if that's not possible, then do the fast thing by default and make the correct thing possible, if slower.

@tkelman
Copy link
Contributor

tkelman commented Nov 22, 2017

Sorting in a platform-inappropriate way is arguably less correct behavior for a list of file names than respecting the platform's iteration order by default.

@stevengj
Copy link
Member

stevengj commented Nov 22, 2017

How many file names do you expect to be sorting? The cost of sorting an array of short strings is going to be dwarfed by the cost of the readdir operation.

I did a couple of quick benchmarks, and it seemed to be a 5% slowdown; surprisingly, I got about the same overhead for both a small directory (40 entries) or a large directory (1000 entries)… not sure why. Maybe someone else should try it too?

I don't have strong feelings about this, but defaulting to the same documented order on all platforms seems a bit friendlier, and makes it easier to reproduce results across platforms/filesystems (since most programmers are probably unaware that the sort order varies).

@carstenbauer
Copy link
Member Author

Maybe someone else should try it too?

Running a quick benchmark on my machine, I find 9.8% slowdown for 50 entries and 11.1% for 1000 entries.

@StefanKarpinski
Copy link
Member

Sorting in a platform-inappropriate way is arguably less correct behavior for a list of file names than respecting the platform's iteration order by default.

Correct by what criterion?

5-10% slowdown for consistent cross-platform behavior seems quite acceptable with an easy opt-out if someone really cares about the slowdown.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 23, 2017

@omus, reasons for that opinion? The pro case is that having consistent cross-platform behavior is desirable. The counterarguments seems to be:

  1. Performance cost (5-10%),
  2. Applications may want the native sort order – it's unclear (to me) why though.

What's clear is that applications "get used to" whatever sort order they get on the platforms they run on. Many applications are not run on multiple platforms, so they can get away with making platform-specific assumptions if we allow them to. There doesn't seem to be any clear case why native order is better (aside from the performance cost of sorting). If we leave native order as the default, then we cannot change this in the future without annoying people and breaking code that hasn't been tested across platforms. If we sort by default, then that's one more thing that works the same everywhere by default when people write programs in Julia. I can't see how that's not a win.

@omus
Copy link
Member

omus commented Nov 23, 2017

From my experience when using readdir I don't rely on any particular ordering and instead treat the result as an iterable set. The only case where ordering should matter is when writing test cases. During testing reproduciblity is more important than performance for me so I don't mind at all incurring the small performance penalty.

With that said I would be fine with this being merged. To me though this is mostly a documentation issue.

@rfourquet
Copy link
Member

I'm mildly in favor of defaulting to the native order (i.e.sort=false), because I would tend to expect readdir to be as thin a wrapper of a system call as possible, although I don't have a case where this would matter. But it's not clear for me why a sorted order is relevant here, except e.g. when displaying to the user, in which case I find it better style to explicitly call sort! (or sort=true) to make the intent clear.

Also, is it likely that returning an iterator instead of a Vector be considered in the future? in which case, sorting would not be an option anymore. Alternatively, a new function could be introduced, like Python's os.scandir (cf. https://www.python.org/dev/peps/pep-0471/ and this issue which requests that os.listdir return an iterator).

@stevengj
Copy link
Member

stevengj commented Nov 23, 2017

There are lots of cases in which the ordering matters for more than appearance. Imagine a case where you do:

for testfile in readdir("tests")
    runtest(testfile)
end

and it works fine on the developers machine, but on the user's machine it breaks because the order changes. You could say that the developer should not have depended on the order (intentionally or not), but cross-platform changes like this might be annoying to debug.

It's hard for me to imagine a case in which a 10% performance penalty would matter for readdir. In any performance-sensitive setting, surely you'd be doing something with the filenames that would swamp the 10%.

@StefanKarpinski
Copy link
Member

Possible applications where order matters:

  • testing
  • hashing the contents of a file tree
  • writing a file browser widget
  • anything user-facing

The real issue here as @stevengj mentions is that we do get guaranteed sorted order on some systems but users can silently get a different and possibly broken experience. The fact that it would only be broken on Windows makes this situation worse since it's often the case that all of a library's developers work on UNIX systems but many users are on Windows.

@rfourquet: if we're going to default to sort=false then we might as well just have people do sort!(readdir()) since it's shorter anyway. I do understand that since readdir() is the name of a libc function this would be expected to be a thin wrapper – bit it already isn't since it's a call to libuv, which already provides a level of platform independence. I would be fine with renaming this to listdir to avoid that expectation.

@tkelman
Copy link
Contributor

tkelman commented Nov 23, 2017

For a browser or anything user facing, the platform order is much more natural. The function doesn't return things in a completely random order on Windows, it returns things in an order that's consistent with how nearly all other tools on the platform display files - sorted but case insensitive. It would be somewhat unnatural and surprising on Windows to group file names that start with an upper case letter first.

@rfourquet
Copy link
Member

rfourquet commented Nov 23, 2017

if we're going to default to sort=false then we might as well just have people do sort!(readdir()) since it's shorter anyway

True, but functions like walkdir would still need to get this sort keyword.

I would be fine with renaming this to listdir to avoid that expectation.

I think I would indeed prefer the name listdir if it's sorted by default.

hashing the contents of a file tree

I myself wrote such a script (with caching) to compare efficiently tree duplicates! And I can't imagine forgetting to sort the lists explicitly; but I'm using unix, so admitedly I would not expect to have a reliable order.

@StefanKarpinski
Copy link
Member

but I'm using unix, so admitedly I would not expect to have a reliable order.

But this is precisely the problem: on UNIX you do get a reliable order. So it's easy to not realize that you may not get the same order elsewhere.

For a browser or anything user facing, the platform order is much more natural.

That's debatable. If someone really wants native ordering, they can just pass sort=false. A big part of the point of the runtime in a language is to abstract over the system details and present a consistent platform on which to run programs. In Julia code, there is a canonical ordering for strings – sorted lexicographically by code point (or more generally for invalid UTF-8, sorted by memcmp).

@rfourquet
Copy link
Member

But this is precisely the problem: on UNIX you do get a reliable order

Ah OK... but not reproducible, in that if a directory A is copied into B, then the list of files obtained in A won't match that of B, so for getting reliable "hash signature" of a directory tree (so that A and B have the same signature), I need to sort the entries.

@tkelman
Copy link
Contributor

tkelman commented Nov 23, 2017

If this were a serious problem that really warranted platform abstraction, then other languages would be sorting rather than following the platform conventions. This isn't just any arbitrary list of strings that get returned, it's a list of file names reflecting file system contents. You're not going to get case collisions on Windows where the Julia sorting conventions for strings are necessary or appropriate - the appropriate sorting function for a list of file names on Windows is one that ignores case.

That's debatable

Does anyone else on this thread actually use windows? What's debatable about following platform conventions? UI toolkits that follow platform look and feel have proven far more popular and useful than ones that attempt to abstract everything into their own opinion about the single correct way to do things regardless of where they're run.

@StefanKarpinski
Copy link
Member

There’s nothing preventing UI toolkits from following the native sorting. You’ve homed in on the one use case that can be argued either way here. That doesn’t address testing issues, hashing, reproducibility, etc.

The argument that other languages don’t abstract this platform behavior so we shouldn’t is pretty thin. Open source scripting languages like Python have traditionally not supported Windows well at all and made decisions like this long before it was a priority.

@tkelman
Copy link
Contributor

tkelman commented Nov 24, 2017

Other languages have been around a very long time, and have had ample opportunity to make a change corresponding to this if they had ever deemed it worth doing. Python 2 to Python 3 made far more consequential changes than this one, but this issue didn't end up included as something they wanted to change during that, or any previous, transitions.

This isn't a matter of "supporting windows better" - it would be overriding the behavior that's most appropriate and conventional for that platform. I don't think that's a desirable default.

@carstenbauer
Copy link
Member Author

I guess I would argue that most users might not really care or properly know about platform specific order. On windows, the only case I can think of where the average user might really notice windows order is in the file explorer. But, I don't think that most users would really think of this as "windows order" but more (weird) "windows explorer order". In this sense, people won't expect windows order in Julia. They expect the output to be ordered or not - because the concept of windows and linux order probably does not exists in their heads.

However, this PR wasn't really my "idea". I just wanted to contribute and looked for "good first issue" tags and created 2 PRs.

BTW: @tkelman I do use windows.

@tkelman
Copy link
Contributor

tkelman commented Dec 4, 2017

It's not a weird ordering specific to explorer, it's a case insensitive order for a case insensitive file system, in every program and essentially every programming language that does this operation of listing files.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Dec 4, 2017

The PR seems good to me (with sort by default), but it could be skipped on platforms that actually do give you sorted order (does that for sure then happen for all filesystems?).

@vtjnash
Copy link
Member

vtjnash commented Dec 4, 2017

One thing I would like to clarify is that the operating system doesn't have an order. Talking about "windows order" and "linux order" aren't logical, since they don't have a canonical ordering. Even "Window's explorer order" is ambiguous, since I don't know if you're talking about pre or post Windows XP or what your group policy setting is (nor are the rules well documented: specifically, they say that it may change). You could talk about the filesystem listing order, but in many cases, that's effectively the same as unordered.

@stevengj
Copy link
Member

stevengj commented Dec 4, 2017

Doing let x = readdir(d); issorted(x) ? x : sort!(x); end instead of sort!(readdir(d)) speeds things up on my machine (MacOS) where the directory listing is already sorted — 3% slowdown instead of 7% slowdown vs. readdir for a directory with 1000 entries.

Wouldn't hurt to add this optimization, at least on non-Windows systems where issorted(x) will commonly return true.

@StefanKarpinski
Copy link
Member

Bump. I still think this is a good idea and the overhead is negligible and you can opt out of it. Yes, the user can explicitly sort the output of readdir but (a) you have to know that you need to do that in order to get consistent results across platforms, and (b) that doesn't help with walkdir.

@StefanKarpinski StefanKarpinski added the triage This should be discussed on a triage call label Mar 23, 2018
@JeffBezanson
Copy link
Member

It's still not clear to me why readdir needs to give names in the same order on all platforms, when for example Dict iteration order isn't specified even on the same platform at different times. In other words, it's fine to have constructs with this kind of ordering behavior.

There also isn't a canonical "correct" sort order. We would be inserting our own opinion about how to sort strings, instead of just doing what other tools on the same platform do.

@StefanKarpinski
Copy link
Member

Well, I want Dicts to preserve order too 😁

@JeffBezanson
Copy link
Member

Currently, we don't make any guarantees about what the order is. Adding such a guarantee is non-breaking, and therefore we can put this aside for now.

@JeffBezanson JeffBezanson removed the triage This should be discussed on a triage call label Mar 29, 2018
@carstenbauer
Copy link
Member Author

Given the discussion above, it's probably best to just add the sort keyword and keep false as the default, right? I just want to get some feedback before I finish this up.

@stevengj
Copy link
Member

stevengj commented Jun 25, 2019

@crstnbr, if sort=false is the default, there's no point — as @StefanKarpinski commented above, you might as well type sort!(readdir(...)) in that case.

@carstenbauer
Copy link
Member Author

Of course, 🤦. Maybe this should be closed then.

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Oct 12, 2019

Reopen? https://mobile.twitter.com/bmarwell/status/1181821143835201536. Preventing even one scientific reproducibility failure like this one seems like a worthwhile reason to sort by default. Compared to the cost of reading the contents of a directory, the cost of sorting the names is never going to matter. The scenario where someone needs the names in native order seems pretty contrived to me and like something it's fine to have people opt into.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
io Involving the I/O subsystem: libuv, read, write, etc. needs news A NEWS entry is required for this change needs tests Unit tests are required for this change
Projects
None yet
Development

Successfully merging this pull request may close these issues.