Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve NormalizedFilePath #453

Merged
merged 4 commits into from
Sep 4, 2022
Merged

Conversation

kokobd
Copy link
Collaborator

@kokobd kokobd commented Sep 2, 2022

No description provided.

Comment on lines 27 to 36

We always store UTF-8 encoded file path in 'NormalizedFilePath'. This function first convert 'OsPath'
to 'FilePath' using current system encoding (see 'decodeFS'), then convert 'FilePath' to an UTF-8
encoded 'ShortByteString'. Due to the encoding conversion, this function can fail. But DO NOTE THAT
encoding mismatch doesn't always mean an exception will be thrown. [Possibly your encoding simply won't
throw exception on failure](https://hackage.haskell.org/package/base-4.17.0.0/docs/src/GHC.IO.Encoding.html#initFileSystemEncoding).
Possibly the conversion function can't find any invalid byte sequence, giving a sucessful but wrong result.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need OsPath at all?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we not store an OsPath directly in NormalizedFilePath?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing OsPath in NormalizedFilePath will make the conversion functions from/to FilePath having a MonadThrow constraint. Then we need to modify plenty of code in HLS to adapt it. And we can not avoid encoding/decoding completely before we adopt OsPath everywhere in HLS, so using OsPath has no significant benefit.

Since conversion functions to/from OsPath in lsp-types is provided now, we can make the migration to OsPath in HLS gradually when GHC 9.6 is released (or maybe when 9.6 becomes the oldest version of GHC we support? then we can avoid plenty of CPP)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Storing OsPath in NormalizedFilePath will make the conversion functions from/to FilePath having a MonadThrow constraint.

I don't understand, why can't you just write something like the below that instantiates the MonadThrow constraint and immediately fails:

fromNormalizedFilePath (NormalizedFilePath _ osPath) = either (error . show) id (OsPath.decodeWith localEncoding osPath)

I don't see how this can ever error assuming that decoding is the inverse of encoding.

And we can not avoid encoding/decoding completely before we adopt OsPath everywhere in HLS, so using OsPath has no significant benefit.

Maybe it would help clarify what is the adoption plan for HLS. Are you saying that this is an intermediate state? Then lay out the plan in the comment so that we can ensure it gets executed when the time comes!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't you just write something like the below that instantiates the MonadThrow constraint and immediately fails:

The idea is to avoid partial functions, and report errors properly.

I don't see how this can ever error assuming that decoding is the inverse of encoding.

Yes, but toNormalizedFilePath can fail if a given Char is not representable in the system encoding.

Then lay out the plan in the comment so that we can ensure it gets executed when the time comes!

Sure.

And note that OsPath is system-dependent. On Windows, it uses UTF16, which consumes more memory. On Unix, it depends on the system locale. Using OsPath inside NormalizedFilePath will lead to significantly different memory footprints on different systems.

In summary, OsPath doesn't provide enough benefits compared to UTF-8 encoded ShortByteString now for these reasons:

  1. can't avoid encoding/decoding, just like ShortByteString
  2. consume more memory on Windows and other non-UTF-8 systems.
  3. incurs partial functions

1 and 3 are solvable when OsPath becomes mainstream. 2 is worth discussing when we decide to use OsPath. So, I'm really not so sure about whether this is an intermediate situation or not. It will be more clear in the future.

Copy link
Collaborator

@pepeiborra pepeiborra Sep 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In summary, OsPath doesn't provide enough benefits compared to UTF-8 encoded ShortByteString now for these reasons.

Thanks, that clarifies the current choices a little bit for me. So this change is not about migrating to OsPath, but to a compact encoding.

Then I would vouch to simplify and use just the text library. To encode, use Text.pack and then extract the Array from inside, and to decode wrap in a Text constructor and call. Or simplify and just store a Text value.

This is compact only when text-2.0 is used, and semi-compact in earlier versions of the text library. I think that's fine, still a massive improvement over Strings.

Copy link
Collaborator Author

@kokobd kokobd Sep 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

Interesting, the results are quite close. (SBS stands for ShortByteString). This is using `text` prior to 2.0 because some dependencies don't support the latest `text`, including stylish-haskell and floskell.

Copy link
Collaborator

@fendor fendor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by comments :)

lsp-types/src/Language/LSP/Types/Uri/OsPath.hs Outdated Show resolved Hide resolved
lsp-types/src/Language/LSP/Types/Uri/OsPath.hs Outdated Show resolved Hide resolved
@kokobd kokobd force-pushed the improve-os-path-doc branch from 7b38c20 to 4a009b3 Compare September 2, 2022 08:36
@kokobd kokobd changed the title add more docs to NormalizedFilePath Improve NormalizedFilePath Sep 2, 2022
@kokobd kokobd force-pushed the improve-os-path-doc branch from 4a009b3 to 77a4340 Compare September 2, 2022 08:44
encodeFilePath = BS.toShort . T.encodeUtf8 . T.pack
-- | Convert 'FilePath' to a UTF-8 encoded 'ShortByteString'
encodeFilePath :: FilePath -> ShortByteString
encodeFilePath = BS.pack . UTF8.encode
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My local benchmark results show that this takes more time than BS.toShort . T.encodeUtf8 . T.pack in most cases. Quite surprising. Waiting for benchmark results on CI. haskell/haskell-language-server#3067

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, maybe not that surprising given that text is likely much more optimised than utf8-string.

The best option here would be to leverage the fact that text-2.0 is internally isomorphic to a ShortByteString plus a pair of ints for slicing, and encoded in UTF-8. So you could just do:

#if min_version_text(2,0)
import Data.ByteString.Short
import qualified Data.Text as Text
import Data.Text.Array(Array(..))
import Data.Text.Internal (Text(Text))
...

encodeFilePath :: FilePath -> ShortByteString 
encodeFilePath fp = case Text.pack fp of
  Text (ByteArray sbs) _ _ -> SBS sbs

decodeFilePath :: ShortByteString -> FilePath
decodeFilePath (SBS fp) = Text.unpack $ Text (ByteArray fp) 0 (length fp) 

@kokobd kokobd added merge me and removed merge me labels Sep 4, 2022
Copy link
Collaborator

@michaelpj michaelpj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks everyone for the great discussion 👍

@kokobd kokobd added the merge me label Sep 4, 2022
@mergify mergify bot merged commit b0f8596 into haskell:master Sep 4, 2022
thomasjm pushed a commit to codedownio/lsp that referenced this pull request Nov 3, 2022
* improve NormalizedFilePath

* add adoption plan

* try Text

* finalize
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants