Fix lexing for unicode escape sequences #348

latkin · 2015-04-08T01:52:16Z

fixes #338

Changes lexing of unicode escape sequences to match the F# spec (which says things should work the same as C#).

For short escape sequences, directly encode the hex value into a char, even if this is not valid by Unicode convention
For long escape sequences, validate that the total codepoint is <= 0x0010FFFF
- If it is, follow same logic as before (which was correct)
- If it isn't, issue an error (same as C#)

fixes dotnet#338 Changes lexing of unicode escape sequences to match the F# spec (which says things should work the same as C#). - For short escape sequences, directly encode the hex value into a char - For long escape sequences, validate that the total codepoint is <= 0x0010FFFF - If it is, follow same logic as before (which was correct) - If it isn't, issue an error (same as C#)

latkin · 2015-04-08T01:55:00Z

src/fsharp/lex.fsl

@@ -124,7 +124,7 @@ let startString args (lexbuf: UnicodeLexing.Lexbuf) =
                            BYTEARRAY (Lexhelp.stringBufferAsBytes buf)
                        )
                     else
-                        STRING (System.Text.Encoding.Unicode.GetString(s,0,s.Length)))  
+                        STRING (Lexhelp.stringBufferAsString s))  


In initial testing, I computed the string both ways here and compared, to make sure there were no unexpected inconsistencies.

After running all of the tests, the only differences were cases where the old method was doing the wrong thing. So regression risk should be low.

forki · 2015-04-08T07:42:35Z

src/fsharp/lexhelp.fsi

@@ -51,7 +52,7 @@ val internal digit : char -> int32
 val internal hexdigit : char -> int32
 val internal unicodeGraphShort : string -> uint16
 val internal hexGraphShort : string -> uint16
-val internal unicodeGraphLong : string -> uint16 option * uint16
+val internal unicodeGraphLong : string -> (uint16 option * uint16) option


Instead of option of option we might want to create a new union type which represents the semantics better.

latkin · 2015-04-08T16:10:49Z

@forki thanks for the feedback, good suggestions. I'd like to keep the tests where they are, to avoid fragmentation. Using a dedicated DU type makes sense. I'll look into broader change to char buffer form byte buffer, if it's not too invasive I think that also makes sense.

KevinRansom · 2015-04-08T21:10:31Z

Looks good ... ship it :-)

latkin · 2015-04-15T02:09:13Z

@forki I've updated to use a dedicated type, but it looked like too much effort to change the buffer design just for this bug

latkin · 2015-04-15T02:10:13Z

/cc @agocke per request

msftclas added the cla-not-required label Apr 8, 2015

latkin reviewed Apr 8, 2015
View reviewed changes

forki mentioned this pull request Apr 8, 2015

Throw error if unicode is in reserved space #339

Closed

forki reviewed Apr 8, 2015
View reviewed changes

Use dedicated type for lex result

55b811d

Use func, not lazy

9b68fc5

latkin closed this in 52e6e03 Apr 15, 2015

latkin added the fixed label Apr 15, 2015

dipenpdev mentioned this pull request Jul 30, 2018

Invalid chars in string treated as equal in C# and unequal in F# #5371

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix lexing for unicode escape sequences #348

Fix lexing for unicode escape sequences #348

latkin commented Apr 8, 2015

latkin Apr 8, 2015

forki Apr 8, 2015

latkin commented Apr 8, 2015

KevinRansom commented Apr 8, 2015

latkin commented Apr 15, 2015

latkin commented Apr 15, 2015

Fix lexing for unicode escape sequences #348

Fix lexing for unicode escape sequences #348

Conversation

latkin commented Apr 8, 2015

latkin Apr 8, 2015

Choose a reason for hiding this comment

forki Apr 8, 2015

Choose a reason for hiding this comment

latkin commented Apr 8, 2015

KevinRansom commented Apr 8, 2015

latkin commented Apr 15, 2015

latkin commented Apr 15, 2015