Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tokenize does not roundtrip {{ after \n #125008

Closed
wyattscarpenter opened this issue Oct 5, 2024 · 7 comments
Closed

Tokenize does not roundtrip {{ after \n #125008

wyattscarpenter opened this issue Oct 5, 2024 · 7 comments
Assignees
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes topic-parser type-bug An unexpected behavior, bug, or error

Comments

@wyattscarpenter
Copy link

wyattscarpenter commented Oct 5, 2024

Bug report

Bug description:

import tokenize, io
source_code = r'''
f"""{80 * '*'}\n{{test}}{{details}}{{test2}}\n{80 * '*'}"""
'''
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)

Expected:


f"""{80 *'*'}\n{{test}}{{details}}{{test2}}\n{80 *'*'}"""

Got:


f"""{80 *'*'}\n{test}}{{details}}{{test2}}\n{80 *'*'}"""

Note the absence of a second { in the {{ after the \n — but in no other positions.

Unlike some other roundtrip failures of tokenize, some of which are minor infelicities, this one actually creates a syntactically invalid program on roundtrip, which is quite bad. You get a SyntaxError: f-string: single '}' is not allowed when trying to use the results.

CPython versions tested on:

3.12

Operating systems tested on:

Linux, Windows

Linked PRs

@wyattscarpenter wyattscarpenter added the type-bug An unexpected behavior, bug, or error label Oct 5, 2024
@wyattscarpenter
Copy link
Author

wyattscarpenter commented Oct 5, 2024

Furthermore, here is the output of the following code:

import tokenize, io
source_code = r'f"\n{{test}}"'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
for t in tokens:
  print(t)
TokenInfo(type=61 (FSTRING_START), string='f"', start=(1, 0), end=(1, 2), line='f"\\n{{test}}"')
TokenInfo(type=62 (FSTRING_MIDDLE), string='\\n{', start=(1, 2), end=(1, 5), line='f"\\n{{test}}"')
TokenInfo(type=62 (FSTRING_MIDDLE), string='test}', start=(1, 6), end=(1, 11), line='f"\\n{{test}}"')
TokenInfo(type=63 (FSTRING_END), string='"', start=(1, 12), end=(1, 13), line='f"\\n{{test}}"')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 13), end=(1, 14), line='f"\\n{{test}}"')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

So, it seems that the line is getting in alright, but the \n{{ is getting turned into a \n{ in the tokenizer somehow.

Same erroneous output for the bytes version (with rb-string, BytesIO and tokenize.tokenize).

@AlexWaygood
Copy link
Member

It looks like this was a regression in Python 3.12; I can't reproduce the behaviour with Python 3.11. I'm guessing it was caused by the PEP-701 changes.

@AlexWaygood AlexWaygood added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Oct 5, 2024
@AlexWaygood
Copy link
Member

Reproduced on the main branch as well.

@tomasr8
Copy link
Member

tomasr8 commented Oct 5, 2024

This seems to happen with other escape characters as well:

import tokenize, io
source_code = r'f"""\t{{test}}"""'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)  # f"""\t{test}}"""
import tokenize, io
source_code = r'f"""\r{{test}}"""'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)  # f"""\r{test}}"""

@tomasr8
Copy link
Member

tomasr8 commented Oct 5, 2024

I think the issue is in this method:

cpython/Lib/tokenize.py

Lines 187 to 208 in 16cd6cc

def escape_brackets(self, token):
characters = []
consume_until_next_bracket = False
for character in token:
if character == "}":
if consume_until_next_bracket:
consume_until_next_bracket = False
else:
characters.append(character)
if character == "{":
n_backslashes = sum(
1 for char in _itertools.takewhile(
"\\".__eq__,
characters[-2::-1]
)
)
if n_backslashes % 2 == 0:
characters.append(character)
else:
consume_until_next_bracket = True
characters.append(character)
return "".join(characters)

This PR fixed the handling of Unicode literals (e.g. \\N{foo}), but it seems to only
be checking for the presence of backslashes without checking if they are followed by N. This appears to fix that:

            if character == "{":
                n_backslashes = sum(
                    1 for char in _itertools.takewhile(
                        "\\".__eq__,
                        characters[-2::-1]
                    )
                )
-               if n_backslashes % 2 == 0:
+               if n_backslashes % 2 == 0 or characters[-1] != "N":
                    characters.append(character)
                else:
                    consume_until_next_bracket = True

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 6, 2024
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 6, 2024
wyattscarpenter added a commit to wyattscarpenter/mypy that referenced this issue Oct 7, 2024
@wyattscarpenter
Copy link
Author

Extremely late for me to say this, but I thought I'd add: I only unearthed this bug because I'm working on code that incidentally tries to tokenizer-roundtrip everything in https://github.com/hauntsaninja/mypy_primer. So, doing something like that (possibly literally just that) as a test case for tokenize could perhaps be a good idea to prevent future regressions — although setting that up sounds like a hassle!

@tomasr8
Copy link
Member

tomasr8 commented Oct 26, 2024

We actually already do that here for some random files in the test folder:

class TestRoundtrip(TestCase):
def check_roundtrip(self, f):
"""
Test roundtrip for `untokenize`. `f` is an open file or a string.
The source code in f is tokenized to both 5- and 2-tuples.
Both sequences are converted back to source code via
tokenize.untokenize(), and the latter tokenized again to 2-tuples.
The test fails if the 3 pair tokenizations do not match.
When untokenize bugs are fixed, untokenize with 5-tuples should
reproduce code that does not contain a backslash continuation
following spaces. A proper test should test this.
"""

Though, so far it only compares the tokens, not the actual source code. I'd like to extend this check to compare the source code as well here: #126010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes topic-parser type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

6 participants