Provide flags for changing encodings #1990
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There is a lot of stuff in this PR, and still more that isn't being done. Here's the general idea, including things that had to change:
\u{80}
) whose value resolves to something whose top bit is set. If there are no other escape sequences in the string, then this works and the string is UTF-8. If there are other non-unicode escape sequences, this will raise a mixed encoding error.We add two flags onto
StringNode
to communicate this information.forced_utf8
andforced_ascii_8bit
. These flags indicate that the string part is being forced into that encoding. They are mutually exclusive (they will never show up at the same time).In order to make comparisons easier, I've changed
pm_parser_t
to hold onto a reference to apm_encoding_t
instead of the struct itself. This means if someone was previously accessingparser->encoding
it would have been a struct, now it's a pointer to a struct. This is a breaking change for the C API.Previously there was a visible reference to
pm_encoding_utf_8
, which was an encoding struct. That has been replaced byPM_ENCODING_UTF_8_ENTRY
, which is a pointer to an encoding struct but which is calculated at runtime. This is a breaking change for the C API.I've also changed the encoding test to not test so many encodings by default. Basically each 2-byte encoding adds about a full second to the test suite, which makes it really annoying for quickly testing changes. I'm going to add
PRISM_TEST_ALL_ENCODINGS=1
to CI in another PR once I assess how much time that'll add to CI. It might not even be important to test all of these encodings tbh since they're not going to change at all.This PR does not address symbols,
xstrings,heredocs, or regular expressions. All of those will come after. I'm just putting this PR up to get the ball rolling.UPDATE: I handled xstrings and heredocs, and also forgot that character literals and list literals also need to be updated so I did those as well. Now we have just symbols and regular expressions.