-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve NormalizedFilePath #453
Conversation
|
||
We always store UTF-8 encoded file path in 'NormalizedFilePath'. This function first convert 'OsPath' | ||
to 'FilePath' using current system encoding (see 'decodeFS'), then convert 'FilePath' to an UTF-8 | ||
encoded 'ShortByteString'. Due to the encoding conversion, this function can fail. But DO NOTE THAT | ||
encoding mismatch doesn't always mean an exception will be thrown. [Possibly your encoding simply won't | ||
throw exception on failure](https://hackage.haskell.org/package/base-4.17.0.0/docs/src/GHC.IO.Encoding.html#initFileSystemEncoding). | ||
Possibly the conversion function can't find any invalid byte sequence, giving a sucessful but wrong result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need OsPath
at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we not store an OsPath
directly in NormalizedFilePath
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing OsPath
in NormalizedFilePath
will make the conversion functions from/to FilePath
having a MonadThrow
constraint. Then we need to modify plenty of code in HLS to adapt it. And we can not avoid encoding/decoding completely before we adopt OsPath
everywhere in HLS, so using OsPath
has no significant benefit.
Since conversion functions to/from OsPath
in lsp-types
is provided now, we can make the migration to OsPath
in HLS gradually when GHC 9.6 is released (or maybe when 9.6 becomes the oldest version of GHC we support? then we can avoid plenty of CPP)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Storing OsPath in NormalizedFilePath will make the conversion functions from/to FilePath having a MonadThrow constraint.
I don't understand, why can't you just write something like the below that instantiates the MonadThrow
constraint and immediately fails:
fromNormalizedFilePath (NormalizedFilePath _ osPath) = either (error . show) id (OsPath.decodeWith localEncoding osPath)
I don't see how this can ever error
assuming that decoding is the inverse of encoding.
And we can not avoid encoding/decoding completely before we adopt OsPath everywhere in HLS, so using OsPath has no significant benefit.
Maybe it would help clarify what is the adoption plan for HLS. Are you saying that this is an intermediate state? Then lay out the plan in the comment so that we can ensure it gets executed when the time comes!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't you just write something like the below that instantiates the
MonadThrow
constraint and immediately fails:
The idea is to avoid partial functions, and report errors properly.
I don't see how this can ever error assuming that decoding is the inverse of encoding.
Yes, but toNormalizedFilePath
can fail if a given Char
is not representable in the system encoding.
Then lay out the plan in the comment so that we can ensure it gets executed when the time comes!
Sure.
And note that OsPath
is system-dependent. On Windows, it uses UTF16, which consumes more memory. On Unix, it depends on the system locale. Using OsPath
inside NormalizedFilePath
will lead to significantly different memory footprints on different systems.
In summary, OsPath
doesn't provide enough benefits compared to UTF-8
encoded ShortByteString
now for these reasons:
- can't avoid encoding/decoding, just like
ShortByteString
- consume more memory on Windows and other non-UTF-8 systems.
- incurs partial functions
1 and 3 are solvable when OsPath
becomes mainstream. 2 is worth discussing when we decide to use OsPath
. So, I'm really not so sure about whether this is an intermediate situation or not. It will be more clear in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In summary, OsPath doesn't provide enough benefits compared to UTF-8 encoded ShortByteString now for these reasons.
Thanks, that clarifies the current choices a little bit for me. So this change is not about migrating to OsPath
, but to a compact encoding.
Then I would vouch to simplify and use just the text
library. To encode, use Text.pack
and then extract the Array
from inside, and to decode wrap in a Text
constructor and call. Or simplify and just store a Text
value.
This is compact only when text-2.0
is used, and semi-compact in earlier versions of the text library. I think that's fine, still a massive improvement over String
s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by comments :)
7b38c20
to
4a009b3
Compare
4a009b3
to
77a4340
Compare
encodeFilePath = BS.toShort . T.encodeUtf8 . T.pack | ||
-- | Convert 'FilePath' to a UTF-8 encoded 'ShortByteString' | ||
encodeFilePath :: FilePath -> ShortByteString | ||
encodeFilePath = BS.pack . UTF8.encode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My local benchmark results show that this takes more time than BS.toShort . T.encodeUtf8 . T.pack
in most cases. Quite surprising. Waiting for benchmark results on CI. haskell/haskell-language-server#3067
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, maybe not that surprising given that text
is likely much more optimised than utf8-string
.
The best option here would be to leverage the fact that text-2.0
is internally isomorphic to a ShortByteString
plus a pair of ints for slicing, and encoded in UTF-8. So you could just do:
#if min_version_text(2,0)
import Data.ByteString.Short
import qualified Data.Text as Text
import Data.Text.Array(Array(..))
import Data.Text.Internal (Text(Text))
...
encodeFilePath :: FilePath -> ShortByteString
encodeFilePath fp = case Text.pack fp of
Text (ByteArray sbs) _ _ -> SBS sbs
decodeFilePath :: ShortByteString -> FilePath
decodeFilePath (SBS fp) = Text.unpack $ Text (ByteArray fp) 0 (length fp)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks everyone for the great discussion 👍
* improve NormalizedFilePath * add adoption plan * try Text * finalize
No description provided.