Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide flags for changing encodings #1990

Merged
merged 1 commit into from
Dec 6, 2023
Merged

Provide flags for changing encodings #1990

merged 1 commit into from
Dec 6, 2023

Conversation

kddnewton
Copy link
Collaborator

@kddnewton kddnewton commented Dec 5, 2023

There is a lot of stuff in this PR, and still more that isn't being done. Here's the general idea, including things that had to change:

  • Sometimes strings have the UTF-8 encoding regardless of the source file's encoding. This occurs when there's a unicode escape sequence (e.g., \u{80}) whose value resolves to something whose top bit is set. If there are no other escape sequences in the string, then this works and the string is UTF-8. If there are other non-unicode escape sequences, this will raise a mixed encoding error.
  • If the source files are US-ASCII encoded and you get a byte >= 0x80, then it will change to string to a ASCII-8BIT string when the string is being created.

We add two flags onto StringNode to communicate this information. forced_utf8 and forced_ascii_8bit. These flags indicate that the string part is being forced into that encoding. They are mutually exclusive (they will never show up at the same time).

In order to make comparisons easier, I've changed pm_parser_t to hold onto a reference to a pm_encoding_t instead of the struct itself. This means if someone was previously accessing parser->encoding it would have been a struct, now it's a pointer to a struct. This is a breaking change for the C API.

Previously there was a visible reference to pm_encoding_utf_8, which was an encoding struct. That has been replaced by PM_ENCODING_UTF_8_ENTRY, which is a pointer to an encoding struct but which is calculated at runtime. This is a breaking change for the C API.

I've also changed the encoding test to not test so many encodings by default. Basically each 2-byte encoding adds about a full second to the test suite, which makes it really annoying for quickly testing changes. I'm going to add PRISM_TEST_ALL_ENCODINGS=1 to CI in another PR once I assess how much time that'll add to CI. It might not even be important to test all of these encodings tbh since they're not going to change at all.

This PR does not address symbols, xstrings, heredocs, or regular expressions. All of those will come after. I'm just putting this PR up to get the ball rolling.

UPDATE: I handled xstrings and heredocs, and also forgot that character literals and list literals also need to be updated so I did those as well. Now we have just symbols and regular expressions.

@kddnewton kddnewton force-pushed the encoding branch 2 times, most recently from eacead4 to 65cc3fd Compare December 5, 2023 20:54
@eregon
Copy link
Member

eregon commented Dec 6, 2023

Looks good.

forced_ascii_8bit

I would suggest force_binary (there is Encoding::BINARY, alias of Encoding::ASCII_8BIT).
It reads better and ascii_8bit has confused so many people already (which think it means ASCII, when in fact it means "no idea of the encoding", depends on the context, 0x41 might mean 'A' or might not and be used as e.g. some binary protocol, unknown without context").
"ascii_8bit" is a weird name, ASCII is 7-bits.
FWIW we tried to fix this in CRuby but failed (due to compatibility concerns): https://bugs.ruby-lang.org/issues/18576
But I think for a Prism flag, the improved clarity and less confusion is highly worth it.

Related issue: #1703

@kddnewton kddnewton force-pushed the encoding branch 2 times, most recently from d2cbce7 to b771960 Compare December 6, 2023 17:26
@kddnewton
Copy link
Collaborator Author

@eregon updated the name to be forced_binary_encoding and forced_utf8_encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants