-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding sort=true keyword to readdir() #24675
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With a test to verify that entries are sorted, and an entry in NEWS.md
this should be good to go. Great PR!
How much CI breakage are we expecting these days? These all seem unrelated but that's not as much green as I'd like to see before merging. In particular, no Windows tests passing is unfortunate since the new tests would have passed previously on other OSes. |
The Travis i686 failure is another instance of the presently common libgit2 failure I mentioned last night on slack/infrastructure. Other instances: https://travis-ci.org/JuliaLang/julia/jobs/305354871 @omus mentioned that #24338 might help sort that failure out? Best! |
What kind of code are you (or @bjarthur) writing that's sensitive to the order here? A case sensitive sorting usually doesn't make sense on Windows. |
Why not? The point of this change is so that the default behavior is more platform independent, not to match any system-specific behavior. If you really want the native sort order, you can request it by doing |
I'm looking into it currently. |
File system behavior isn't platform independent, and it doesn't seem worth making the default slower to pretend it is here. When working with the file system, seems more direct to return what the system call gives you, and add a call to |
Cf. #939 (comment) |
How many file names do you expect to be sorting? The cost of sorting an array of short strings is going to be dwarfed by the cost of the
Doing the correct thing by default where it is reasonably fast is the Julian style; if that's not possible, then do the fast thing by default and make the correct thing possible, if slower. |
Sorting in a platform-inappropriate way is arguably less correct behavior for a list of file names than respecting the platform's iteration order by default. |
I did a couple of quick benchmarks, and it seemed to be a 5% slowdown; surprisingly, I got about the same overhead for both a small directory (40 entries) or a large directory (1000 entries)… not sure why. Maybe someone else should try it too? I don't have strong feelings about this, but defaulting to the same documented order on all platforms seems a bit friendlier, and makes it easier to reproduce results across platforms/filesystems (since most programmers are probably unaware that the sort order varies). |
Running a quick benchmark on my machine, I find 9.8% slowdown for 50 entries and 11.1% for 1000 entries. |
Correct by what criterion? 5-10% slowdown for consistent cross-platform behavior seems quite acceptable with an easy opt-out if someone really cares about the slowdown. |
@omus, reasons for that opinion? The pro case is that having consistent cross-platform behavior is desirable. The counterarguments seems to be:
What's clear is that applications "get used to" whatever sort order they get on the platforms they run on. Many applications are not run on multiple platforms, so they can get away with making platform-specific assumptions if we allow them to. There doesn't seem to be any clear case why native order is better (aside from the performance cost of sorting). If we leave native order as the default, then we cannot change this in the future without annoying people and breaking code that hasn't been tested across platforms. If we sort by default, then that's one more thing that works the same everywhere by default when people write programs in Julia. I can't see how that's not a win. |
From my experience when using With that said I would be fine with this being merged. To me though this is mostly a documentation issue. |
I'm mildly in favor of defaulting to the native order (i.e. Also, is it likely that returning an iterator instead of a |
There are lots of cases in which the ordering matters for more than appearance. Imagine a case where you do: for testfile in readdir("tests")
runtest(testfile)
end and it works fine on the developers machine, but on the user's machine it breaks because the order changes. You could say that the developer should not have depended on the order (intentionally or not), but cross-platform changes like this might be annoying to debug. It's hard for me to imagine a case in which a 10% performance penalty would matter for |
Possible applications where order matters:
The real issue here as @stevengj mentions is that we do get guaranteed sorted order on some systems but users can silently get a different and possibly broken experience. The fact that it would only be broken on Windows makes this situation worse since it's often the case that all of a library's developers work on UNIX systems but many users are on Windows. @rfourquet: if we're going to default to |
For a browser or anything user facing, the platform order is much more natural. The function doesn't return things in a completely random order on Windows, it returns things in an order that's consistent with how nearly all other tools on the platform display files - sorted but case insensitive. It would be somewhat unnatural and surprising on Windows to group file names that start with an upper case letter first. |
True, but functions like
I think I would indeed prefer the name
I myself wrote such a script (with caching) to compare efficiently tree duplicates! And I can't imagine forgetting to sort the lists explicitly; but I'm using unix, so admitedly I would not expect to have a reliable order. |
But this is precisely the problem: on UNIX you do get a reliable order. So it's easy to not realize that you may not get the same order elsewhere.
That's debatable. If someone really wants native ordering, they can just pass |
Ah OK... but not reproducible, in that if a directory A is copied into B, then the list of files obtained in A won't match that of B, so for getting reliable "hash signature" of a directory tree (so that A and B have the same signature), I need to sort the entries. |
If this were a serious problem that really warranted platform abstraction, then other languages would be sorting rather than following the platform conventions. This isn't just any arbitrary list of strings that get returned, it's a list of file names reflecting file system contents. You're not going to get case collisions on Windows where the Julia sorting conventions for strings are necessary or appropriate - the appropriate sorting function for a list of file names on Windows is one that ignores case.
Does anyone else on this thread actually use windows? What's debatable about following platform conventions? UI toolkits that follow platform look and feel have proven far more popular and useful than ones that attempt to abstract everything into their own opinion about the single correct way to do things regardless of where they're run. |
There’s nothing preventing UI toolkits from following the native sorting. You’ve homed in on the one use case that can be argued either way here. That doesn’t address testing issues, hashing, reproducibility, etc. The argument that other languages don’t abstract this platform behavior so we shouldn’t is pretty thin. Open source scripting languages like Python have traditionally not supported Windows well at all and made decisions like this long before it was a priority. |
Other languages have been around a very long time, and have had ample opportunity to make a change corresponding to this if they had ever deemed it worth doing. Python 2 to Python 3 made far more consequential changes than this one, but this issue didn't end up included as something they wanted to change during that, or any previous, transitions. This isn't a matter of "supporting windows better" - it would be overriding the behavior that's most appropriate and conventional for that platform. I don't think that's a desirable default. |
I guess I would argue that most users might not really care or properly know about platform specific order. On windows, the only case I can think of where the average user might really notice windows order is in the file explorer. But, I don't think that most users would really think of this as "windows order" but more (weird) "windows explorer order". In this sense, people won't expect windows order in Julia. They expect the output to be ordered or not - because the concept of windows and linux order probably does not exists in their heads. However, this PR wasn't really my "idea". I just wanted to contribute and looked for "good first issue" tags and created 2 PRs. BTW: @tkelman I do use windows. |
It's not a weird ordering specific to explorer, it's a case insensitive order for a case insensitive file system, in every program and essentially every programming language that does this operation of listing files. |
The PR seems good to me (with sort by default), but it could be skipped on platforms that actually do give you sorted order (does that for sure then happen for all filesystems?). |
One thing I would like to clarify is that the operating system doesn't have an order. Talking about "windows order" and "linux order" aren't logical, since they don't have a canonical ordering. Even "Window's explorer order" is ambiguous, since I don't know if you're talking about pre or post Windows XP or what your group policy setting is (nor are the rules well documented: specifically, they say that it may change). You could talk about the filesystem listing order, but in many cases, that's effectively the same as unordered. |
Doing Wouldn't hurt to add this optimization, at least on non-Windows systems where |
Bump. I still think this is a good idea and the overhead is negligible and you can opt out of it. Yes, the user can explicitly sort the output of |
It's still not clear to me why readdir needs to give names in the same order on all platforms, when for example Dict iteration order isn't specified even on the same platform at different times. In other words, it's fine to have constructs with this kind of ordering behavior. There also isn't a canonical "correct" sort order. We would be inserting our own opinion about how to sort strings, instead of just doing what other tools on the same platform do. |
Well, I want Dicts to preserve order too 😁 |
Currently, we don't make any guarantees about what the order is. Adding such a guarantee is non-breaking, and therefore we can put this aside for now. |
Given the discussion above, it's probably best to just add the |
@crstnbr, if |
Of course, 🤦. Maybe this should be closed then. |
Reopen? https://mobile.twitter.com/bmarwell/status/1181821143835201536. Preventing even one scientific reproducibility failure like this one seems like a worthwhile reason to sort by default. Compared to the cost of reading the contents of a directory, the cost of sorting the names is never going to matter. The scenario where someone needs the names in native order seems pretty contrived to me and like something it's fine to have people opt into. |
See #24626.