Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix lexing for unicode escape sequences #348

Closed
wants to merge 3 commits into from

Conversation

latkin
Copy link
Contributor

@latkin latkin commented Apr 8, 2015

fixes #338

Changes lexing of unicode escape sequences to match the F# spec (which says things should work the same as C#).

  • For short escape sequences, directly encode the hex value into a char, even if this is not valid by Unicode convention
  • For long escape sequences, validate that the total codepoint is <= 0x0010FFFF
    • If it is, follow same logic as before (which was correct)
    • If it isn't, issue an error (same as C#)

fixes dotnet#338

Changes lexing of unicode escape sequences to match the F# spec (which says things should work the same as C#).

- For short escape sequences, directly encode the hex value into a char
- For long escape sequences, validate that the total codepoint is <= 0x0010FFFF
  - If it is, follow same logic as before (which was correct)
  - If it isn't, issue an error (same as C#)
@@ -124,7 +124,7 @@ let startString args (lexbuf: UnicodeLexing.Lexbuf) =
BYTEARRAY (Lexhelp.stringBufferAsBytes buf)
)
else
STRING (System.Text.Encoding.Unicode.GetString(s,0,s.Length)))
STRING (Lexhelp.stringBufferAsString s))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In initial testing, I computed the string both ways here and compared, to make sure there were no unexpected inconsistencies.

After running all of the tests, the only differences were cases where the old method was doing the wrong thing. So regression risk should be low.

@@ -51,7 +52,7 @@ val internal digit : char -> int32
val internal hexdigit : char -> int32
val internal unicodeGraphShort : string -> uint16
val internal hexGraphShort : string -> uint16
val internal unicodeGraphLong : string -> uint16 option * uint16
val internal unicodeGraphLong : string -> (uint16 option * uint16) option
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of option of option we might want to create a new union type which represents the semantics better.

@latkin
Copy link
Contributor Author

latkin commented Apr 8, 2015

@forki thanks for the feedback, good suggestions. I'd like to keep the tests where they are, to avoid fragmentation. Using a dedicated DU type makes sense. I'll look into broader change to char buffer form byte buffer, if it's not too invasive I think that also makes sense.

@KevinRansom
Copy link
Member

Looks good ... ship it :-)

@latkin
Copy link
Contributor Author

latkin commented Apr 15, 2015

@forki I've updated to use a dedicated type, but it looked like too much effort to change the buffer design just for this bug

@latkin
Copy link
Contributor Author

latkin commented Apr 15, 2015

/cc @agocke per request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants