Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make require case-sensitive on all platforms #13542

Merged
merged 1 commit into from
Oct 13, 2015

Conversation

malmaud
Copy link
Contributor

@malmaud malmaud commented Oct 11, 2015

Resolves #5789.

@tkelman
Copy link
Contributor

tkelman commented Oct 11, 2015

I'd expect readdir to be really slow here.

@malmaud
Copy link
Contributor Author

malmaud commented Oct 11, 2015

I don't think there's any other reliable solution. I wonder how much of a concern it is in practice - the check can be disabled on linux, which is the OS where it's most likely there will remote volumes where readdir's performance could be particularly problematic.

If the folders are local, then I'm sceptical the cost of ~20 readdir's on package src directories is meaningful compared the total runtime of importing a package.

@malmaud
Copy link
Contributor Author

malmaud commented Oct 11, 2015

@simonster suggested using realpath, which will work on Linux on OS X. Still not sure there is a great solution for windows - the best I could find is http://stackoverflow.com/questions/4763117/how-can-i-obtain-the-case-sensitive-path-on-windows, but that doesn't seem great since it will randomly fail if the underlying volume has short path names disabled.

@malmaud
Copy link
Contributor Author

malmaud commented Oct 12, 2015

Simon correctly points out that the realpath solution won't work with symbolic links.

@simonster
Copy link
Member

But you could still use it as long as islink returns false, and fall back to the slow path if islink returns true (shouldn't be very common).

@malmaud
Copy link
Contributor Author

malmaud commented Oct 12, 2015

OK, the only case where the pr falls back to listing a directory is if the file is a symbolic link on OS X. I think I have a working solution for Windows - at least it works on my Windows 7 x64 box.

https://developer.apple.com/library/mac/documentation/Darwin/Reference/ManPages/man2/getattrlist.2.html might help with OS X.

@osx_only function isfile_casesensitive(path)
isfile(path) || return false
islink(path) && return isfile_casesensitive_slow(path)
realpath(path) == path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you can find a better solution, I think this should check basename(realpath(path)) == basename(path) in case path is not a symlink but is in a symlinked directory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@StefanKarpinski
Copy link
Member

Excellent – thanks for tackling this.

@tkelman
Copy link
Contributor

tkelman commented Oct 12, 2015

needs a test

@malmaud
Copy link
Contributor Author

malmaud commented Oct 12, 2015

Good news - I've got working C code for getting the canonical capitalization in OS X. Just a matter of wrapping it in Julia now.

@malmaud
Copy link
Contributor Author

malmaud commented Oct 12, 2015

OK, all the functionality is here. I'll clean it up and tests and then it should be g2g.

@@ -2,6 +2,52 @@

# Base.require is the implementation for the `import` statement

# Generic case-sensitive version of isfile
function isfile_casesensitive_slow(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it would be cleaner to do

if OS_NAME == :Windows
   function isfile_casesensitive(path)
       ...
   end
elseif OS_NAME == :Linux
    ...
else
   ...
end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed

@stevengj
Copy link
Member

Also fixes #9007.

@stevengj
Copy link
Member

See also how Python 2.7 does it.

@malmaud
Copy link
Contributor Author

malmaud commented Oct 12, 2015

Looks like it's walking the directory tree.

@stevengj
Copy link
Member

Yes, looks like it. Of course, they had to support OS/2 and other ancient systems, so maybe the things you are doing were not available at some point.

buf_size *= 2
continue
end
canon_path = bytestring(pointer(buf, 13))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you just do canon_path = UTF8String(buf) in order to avoid making a copy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The path is stored at an offset from the beginning of the buffer, so I had to use a somewhat inelegant workaround.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see.

@stevengj
Copy link
Member

I'm not sure this is handling Unicode normalization properly.

Suppose you have a file named "ô". There are two logically equivalent ways to encode this string in Unicode, given by normalize_string("ô", :NFC) (= one codepoint U+00f4) and normalize_string("ô", :NFD) (= two codepoints U+006f and U+0302).

On MacOS, isfile returns true for either encoding. Internally, HFS+ converts all names to NFD, and presumably this is what getattrlist returns. That means that isfile_casesensitive(normalize_string("ô", :NFC)) will fail even though the file exists. Solution: call nfd(path) before comparing, where nfd(s::AbstractString) = isascii(s) ? bytestring(s) : normalize_string(s, :NFD).

On Windows, I'm not 100% sure of the situation. The documentation says There is no need to perform any Unicode normalization on path and file name strings for use by the Windows file I/O API functions because the file system treats path and file names as an opaque sequence of WCHARs. This makes me think that normalize_string("ô", :NFC) and normalize_string("ô", :NFD) are distinct filenames on Windows, but it would be good to verify this (i.e. just try creating two files with these names in the same directory and see whether they co-exist).

(On Linux and probably most other Unix-like systems, the NFC and NFD versions are distinct filenames.)

@johansigfrids
Copy link

Yes, I can confirm normalize_string("ô", :NFC) and normalize_string("ô", :NFD) are distinct files on Windows. Windows file system is Unicode (and UTF-16) mostly by convention, in practice its is just a Vector{UInt16} for which a best effort is made to display as UTF-16.

@malmaud
Copy link
Contributor Author

malmaud commented Oct 12, 2015

Great catch @stevengj .

@nalimilan
Copy link
Member

And what happens on OS X when a different filesystem from HFS+ is used? Should we even care about that scenario?

@malmaud
Copy link
Contributor Author

malmaud commented Oct 12, 2015

You're right, that's a real concern. If a file is stored on a mounted non-hfs+ directory, has a unicode character, and isn't NFD-normalized, it would be invisible to require. Maybe we should stat the file to get the filesystem its mounted on and only do the utf normalization if its on hfs+.

@malmaud
Copy link
Contributor Author

malmaud commented Oct 12, 2015

The same issue applies to linux: What if the file happens to be on a case-insensitive filesystem? Then loading.jl:9 is wrong. Although at least it's not as serious, since it would be no more wrong than the current behavior of require.

@malmaud
Copy link
Contributor Author

malmaud commented Oct 13, 2015

Alright, I think I worked around the OS X non-HFS+ concern wrt utf normalization.

if normed_path == path
false
else
isfile_casesensitive(normed_path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just:

isascii(path_basename) && return false # shortcut for common case of ASCII paths
normed_basename = normalize_string(path_basename, :NFD)
return normed_basename.data == casepreserved_basename

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed.

@stevengj
Copy link
Member

LGTM once Travis is green and the commits are squashed.

@@ -10,3 +10,17 @@ thefname = "the fname!//\\&\0\1*"
@test include_string("Base.source_path()", thefname) == Base.source_path()
@test basename(@__FILE__) == "loading.jl"
@test isabspath(@__FILE__)

let true_filename = "cAsEtEsT.jl", lowered_filename="casetest.jl"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment here:

# Issue #5789 and PR #13542:

@malmaud malmaud force-pushed the jmm/case_sensitive_loading branch from d61579f to b78f560 Compare October 13, 2015 04:06
@malmaud
Copy link
Contributor Author

malmaud commented Oct 13, 2015

OK, squashed and tests are green.

stevengj added a commit that referenced this pull request Oct 13, 2015
Make require case-sensitive on all platforms
@stevengj stevengj merged commit b6c0d95 into master Oct 13, 2015
@stevengj stevengj deleted the jmm/case_sensitive_loading branch October 13, 2015 14:52
@stevengj
Copy link
Member

Thanks, @malmaud, for pushing this through!

@malmaud
Copy link
Contributor Author

malmaud commented Oct 13, 2015

Thanks for the reviewing!
On Tue, Oct 13, 2015 at 10:53 AM Steven G. Johnson [email protected]
wrote:

Thanks, @malmaud https://github.com/malmaud, for pushing this through!


Reply to this email directly or view it on GitHub
#13542 (comment).

@stevengj
Copy link
Member

Can you submit a quick PR for realpath before we forget?

# GetLongPathName Win32 function returns the case-preserved filename on NTFS.
function isfile_casesensitive(path)
isfile(path) || return false # Fail fast
longpath(path) == path
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly this should be basename(longpath(path)) == basename(path)? Two reasons:

  • On MacOS, we only check that the basename matches, so this is inconsistent.
  • The intent of this was that import Foo would check for Foo in a case-sensitive way. But this does more than that, and checks the whole path. This means that users LOAD_PATH and HOME variables becomes case-sensitive, for example, which is unexpected on case-insensitive systems, and has led to at least one confused user on discourse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants