Provide flags for changing encodings #1990

kddnewton · 2023-12-05T20:43:26Z

There is a lot of stuff in this PR, and still more that isn't being done. Here's the general idea, including things that had to change:

Sometimes strings have the UTF-8 encoding regardless of the source file's encoding. This occurs when there's a unicode escape sequence (e.g., \u{80}) whose value resolves to something whose top bit is set. If there are no other escape sequences in the string, then this works and the string is UTF-8. If there are other non-unicode escape sequences, this will raise a mixed encoding error.
If the source files are US-ASCII encoded and you get a byte >= 0x80, then it will change to string to a ASCII-8BIT string when the string is being created.

We add two flags onto StringNode to communicate this information. forced_utf8 and forced_ascii_8bit. These flags indicate that the string part is being forced into that encoding. They are mutually exclusive (they will never show up at the same time).

In order to make comparisons easier, I've changed pm_parser_t to hold onto a reference to a pm_encoding_t instead of the struct itself. This means if someone was previously accessing parser->encoding it would have been a struct, now it's a pointer to a struct. This is a breaking change for the C API.

Previously there was a visible reference to pm_encoding_utf_8, which was an encoding struct. That has been replaced by PM_ENCODING_UTF_8_ENTRY, which is a pointer to an encoding struct but which is calculated at runtime. This is a breaking change for the C API.

I've also changed the encoding test to not test so many encodings by default. Basically each 2-byte encoding adds about a full second to the test suite, which makes it really annoying for quickly testing changes. I'm going to add PRISM_TEST_ALL_ENCODINGS=1 to CI in another PR once I assess how much time that'll add to CI. It might not even be important to test all of these encodings tbh since they're not going to change at all.

This PR does not address symbols, ~~xstrings~~, ~~heredocs~~, or regular expressions. All of those will come after. I'm just putting this PR up to get the ball rolling.

UPDATE: I handled xstrings and heredocs, and also forgot that character literals and list literals also need to be updated so I did those as well. Now we have just symbols and regular expressions.

eregon · 2023-12-06T12:31:21Z

Looks good.

forced_ascii_8bit

I would suggest force_binary (there is Encoding::BINARY, alias of Encoding::ASCII_8BIT).
It reads better and ascii_8bit has confused so many people already (which think it means ASCII, when in fact it means "no idea of the encoding", depends on the context, 0x41 might mean 'A' or might not and be used as e.g. some binary protocol, unknown without context").
"ascii_8bit" is a weird name, ASCII is 7-bits.
FWIW we tried to fix this in CRuby but failed (due to compatibility concerns): https://bugs.ruby-lang.org/issues/18576
But I think for a Prism flag, the improved clarity and less confusion is highly worth it.

Related issue: #1703

kddnewton · 2023-12-06T17:39:03Z

@eregon updated the name to be forced_binary_encoding and forced_utf8_encoding

kddnewton force-pushed the encoding branch 2 times, most recently from eacead4 to 65cc3fd Compare December 5, 2023 20:54

kddnewton force-pushed the encoding branch 2 times, most recently from d2cbce7 to b771960 Compare December 6, 2023 17:26

Provide flags for changing encodings

e838eaf

kddnewton force-pushed the encoding branch from b771960 to e838eaf Compare December 6, 2023 17:30

kddnewton mentioned this pull request Dec 6, 2023

Update string and xstring encodings ruby/ruby#9142

Merged

kddnewton requested a review from jemmaissroff December 6, 2023 17:55

kddnewton merged commit 74d25ac into main Dec 6, 2023

kddnewton deleted the encoding branch December 6, 2023 19:25

seven1m mentioned this pull request Dec 7, 2023

Support magic comment to set encoding natalie-lang/natalie#1563

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide flags for changing encodings #1990

Provide flags for changing encodings #1990

kddnewton commented Dec 5, 2023 •

edited

Loading

eregon commented Dec 6, 2023

kddnewton commented Dec 6, 2023

Provide flags for changing encodings #1990

Provide flags for changing encodings #1990

Conversation

kddnewton commented Dec 5, 2023 • edited Loading

eregon commented Dec 6, 2023

kddnewton commented Dec 6, 2023

kddnewton commented Dec 5, 2023 •

edited

Loading