Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

StringInterner: add support for UCN identifiers #838

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ehaas
Copy link
Collaborator

@ehaas ehaas commented Jan 12, 2025

Closes #823

IMO C should not have added this feature :)

I renamed StringInterner to IdentifierInterner to better reflect how it's meant to be used. I switched to XxHash3 instead of Wyhash because it seems like Wyhash requires more care / special handling around short vs long keys in a way that seemed tricky to handle without completely decoding the identifier first (which would require allocations).

Overall I'm very happy with the design of the Zig stdlib hashmap API in that adding this was very straightforward and didn't add any complexity beyond the inherent complexity of the problem itself.

TODO:

  • UCN identifiers require C99 or later
  • Profile performance vs baseline
  • Fuzzing
  • Handle edge cases instead of panicking
  • General code cleanup and review
  • Validate UCN in tokenizer?
  • Basic character set characters cannot be specified with UCNs

@ehaas ehaas force-pushed the ucn branch 2 times, most recently from 22c237c to cd82381 Compare January 12, 2025 23:36
@Vexu
Copy link
Owner

Vexu commented Jan 13, 2025

Should these be unescaped in the preprocessor? Commenting out the 你好 field in the test prints out the unescaped version in clang & gcc and the UCN identifier with this PR.

$ gcc a.c
a.c: In function ‘bar’:
a.c:251:6: error: ‘struct S’ has no member named ‘你好’
  251 |     s.\u4F60\u597D = x;
      |      ^
a.c:252:13: error: ‘struct S’ has no member named ‘你好’
  252 |     return s.FOO;
      |             ^
$ clang a.c
a.c:251:7: error: no member named '你好' in 'struct S'
  251 |     s.\u4F60\u597D = x;
      |     ~ ^
a.c:252:14: error: no member named '\u4F60\u597D' in 'struct S'
  252 |     return s.FOO;
      |            ~ ^
a.c:247:13: note: expanded from macro 'FOO'
  247 | #define FOO \u4F60 ## \u597D
      |             ^
<scratch space>:4:1: note: expanded from here
    4 | \u4F60\u597D
      | ^
2 errors generated.
$ arocc a.c -Wno-return-type
a.c:251:7: error: no member named '\u4F60\u597D' in 'struct S'
    s.\u4F60\u597D = x;
      ^
a.c:252:14: error: no member named '\u4F60\u597D' in 'struct S'
    return s.FOO;
             ^
a.c:247:13: note: expanded from here
#define FOO \u4F60 ## \u597D
            ^
<scratch space>:1:1: note: expanded from here
\u4F60\u597D
^
2 errors generated.

gcc definitely handles this the best.

IMO C should not have added this feature :)

Yeah :)

@ehaas
Copy link
Collaborator Author

ehaas commented Jan 13, 2025

Good point about the preprocessor; the current implementation won't handle them correctly in preprocessor expressions. I'll try setting it up to create unescaped generated tokens

@ehaas
Copy link
Collaborator Author

ehaas commented Jan 22, 2025

Updated with a proof-of-concept of unescaping in the preprocessor. Still need to review expansion locations in the generated tokens and fix pasted tokens but it seems like this approach will work.

@ehaas ehaas marked this pull request as ready for review February 10, 2025 07:39
@ehaas
Copy link
Collaborator Author

ehaas commented Feb 10, 2025

This is in better shape now but I still think it could be improved - right now I'm validating UCN identifiers in the preprocessor but it should probably be done in the tokenizer, to prevent pasting from creating UCN tokens (e.g. \u4F ## 60 should not produce a UCN token). Also when I was about 95% done I realized I had implemented the same unicode escape parsing before, in text_literal.zig so there's some code duplication that could be removed.

@Vexu
Copy link
Owner

Vexu commented Feb 10, 2025

but it should probably be done in the tokenizer, to prevent pasting from creating UCN tokens (e.g. \u4F ## 60 should not produce a UCN token).

Why couldn't that be done in the preprocessor?

Merge at will but maybe leave the issue open (or open a new one) if this doesn't solve it adequately in your opinion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support UCN identifiers
2 participants