StringInterner: add support for UCN identifiers #838

ehaas · 2025-01-12T23:31:33Z

Closes #823

IMO C should not have added this feature :)

I renamed StringInterner to IdentifierInterner to better reflect how it's meant to be used. I switched to XxHash3 instead of Wyhash because it seems like Wyhash requires more care / special handling around short vs long keys in a way that seemed tricky to handle without completely decoding the identifier first (which would require allocations).

Overall I'm very happy with the design of the Zig stdlib hashmap API in that adding this was very straightforward and didn't add any complexity beyond the inherent complexity of the problem itself.

TODO:

UCN identifiers require C99 or later
Profile performance vs baseline
Fuzzing
Handle edge cases instead of panicking
General code cleanup and review
Validate UCN in tokenizer?
Basic character set characters cannot be specified with UCNs

Vexu · 2025-01-13T09:41:25Z

Should these be unescaped in the preprocessor? Commenting out the 你好 field in the test prints out the unescaped version in clang & gcc and the UCN identifier with this PR.

$ gcc a.c
a.c: In function ‘bar’:
a.c:251:6: error: ‘struct S’ has no member named ‘你好’
  251 |     s.\u4F60\u597D = x;
      |      ^
a.c:252:13: error: ‘struct S’ has no member named ‘你好’
  252 |     return s.FOO;
      |             ^
$ clang a.c
a.c:251:7: error: no member named '你好' in 'struct S'
  251 |     s.\u4F60\u597D = x;
      |     ~ ^
a.c:252:14: error: no member named '\u4F60\u597D' in 'struct S'
  252 |     return s.FOO;
      |            ~ ^
a.c:247:13: note: expanded from macro 'FOO'
  247 | #define FOO \u4F60 ## \u597D
      |             ^
<scratch space>:4:1: note: expanded from here
    4 | \u4F60\u597D
      | ^
2 errors generated.
$ arocc a.c -Wno-return-type
a.c:251:7: error: no member named '\u4F60\u597D' in 'struct S'
    s.\u4F60\u597D = x;
      ^
a.c:252:14: error: no member named '\u4F60\u597D' in 'struct S'
    return s.FOO;
             ^
a.c:247:13: note: expanded from here
#define FOO \u4F60 ## \u597D
            ^
<scratch space>:1:1: note: expanded from here
\u4F60\u597D
^
2 errors generated.

gcc definitely handles this the best.

IMO C should not have added this feature :)

Yeah :)

ehaas · 2025-01-13T19:07:09Z

Good point about the preprocessor; the current implementation won't handle them correctly in preprocessor expressions. I'll try setting it up to create unescaped generated tokens

ehaas · 2025-01-22T06:49:31Z

Updated with a proof-of-concept of unescaping in the preprocessor. Still need to review expansion locations in the generated tokens and fix pasted tokens but it seems like this approach will work.

Closes Vexu#823

ehaas · 2025-02-10T07:42:09Z

This is in better shape now but I still think it could be improved - right now I'm validating UCN identifiers in the preprocessor but it should probably be done in the tokenizer, to prevent pasting from creating UCN tokens (e.g. \u4F ## 60 should not produce a UCN token). Also when I was about 95% done I realized I had implemented the same unicode escape parsing before, in text_literal.zig so there's some code duplication that could be removed.

Vexu · 2025-02-10T20:00:10Z

but it should probably be done in the tokenizer, to prevent pasting from creating UCN tokens (e.g. \u4F ## 60 should not produce a UCN token).

Why couldn't that be done in the preprocessor?

Merge at will but maybe leave the issue open (or open a new one) if this doesn't solve it adequately in your opinion.

ehaas force-pushed the ucn branch 2 times, most recently from 22c237c to cd82381 Compare January 12, 2025 23:36

ehaas force-pushed the ucn branch from cd82381 to 3b12e78 Compare January 22, 2025 06:48

ehaas force-pushed the ucn branch from 3b12e78 to 285b80d Compare February 9, 2025 20:59

Preprocessor: add support for UCN identifiers

1952342

Closes Vexu#823

ehaas force-pushed the ucn branch from 285b80d to 1952342 Compare February 10, 2025 04:06

ehaas marked this pull request as ready for review February 10, 2025 07:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

StringInterner: add support for UCN identifiers #838

StringInterner: add support for UCN identifiers #838

ehaas commented Jan 12, 2025 •

edited

Loading

Vexu commented Jan 13, 2025

ehaas commented Jan 13, 2025

ehaas commented Jan 22, 2025

ehaas commented Feb 10, 2025

Vexu commented Feb 10, 2025

StringInterner: add support for UCN identifiers #838

Are you sure you want to change the base?

StringInterner: add support for UCN identifiers #838

Conversation

ehaas commented Jan 12, 2025 • edited Loading

Vexu commented Jan 13, 2025

ehaas commented Jan 13, 2025

ehaas commented Jan 22, 2025

ehaas commented Feb 10, 2025

Vexu commented Feb 10, 2025

ehaas commented Jan 12, 2025 •

edited

Loading