Tokenize does not roundtrip {{ after \n #125008

wyattscarpenter · 2024-10-05T17:48:05Z

Bug report

Bug description:

import tokenize, io
source_code = r'''
f"""{80 * '*'}\n{{test}}{{details}}{{test2}}\n{80 * '*'}"""
'''
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)

Expected:


f"""{80 *'*'}\n{{test}}{{details}}{{test2}}\n{80 *'*'}"""

Got:


f"""{80 *'*'}\n{test}}{{details}}{{test2}}\n{80 *'*'}"""

Note the absence of a second { in the {{ after the \n — but in no other positions.

Unlike some other roundtrip failures of tokenize, some of which are minor infelicities, this one actually creates a syntactically invalid program on roundtrip, which is quite bad. You get a SyntaxError: f-string: single '}' is not allowed when trying to use the results.

CPython versions tested on:

3.12

Operating systems tested on:

Linux, Windows

Linked PRs

The text was updated successfully, but these errors were encountered:

wyattscarpenter · 2024-10-05T17:54:41Z

Furthermore, here is the output of the following code:

import tokenize, io
source_code = r'f"\n{{test}}"'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
for t in tokens:
  print(t)

TokenInfo(type=61 (FSTRING_START), string='f"', start=(1, 0), end=(1, 2), line='f"\\n{{test}}"')
TokenInfo(type=62 (FSTRING_MIDDLE), string='\\n{', start=(1, 2), end=(1, 5), line='f"\\n{{test}}"')
TokenInfo(type=62 (FSTRING_MIDDLE), string='test}', start=(1, 6), end=(1, 11), line='f"\\n{{test}}"')
TokenInfo(type=63 (FSTRING_END), string='"', start=(1, 12), end=(1, 13), line='f"\\n{{test}}"')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 13), end=(1, 14), line='f"\\n{{test}}"')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')

So, it seems that the line is getting in alright, but the \n{{ is getting turned into a \n{ in the tokenizer somehow.

Same erroneous output for the bytes version (with rb-string, BytesIO and tokenize.tokenize).

AlexWaygood · 2024-10-05T18:19:15Z

It looks like this was a regression in Python 3.12; I can't reproduce the behaviour with Python 3.11. I'm guessing it was caused by the PEP-701 changes.

AlexWaygood · 2024-10-05T18:23:46Z

Reproduced on the main branch as well.

tomasr8 · 2024-10-05T21:05:05Z

This seems to happen with other escape characters as well:

import tokenize, io
source_code = r'f"""\t{{test}}"""'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)  # f"""\t{test}}"""

import tokenize, io
source_code = r'f"""\r{{test}}"""'
tokens = tokenize.generate_tokens(io.StringIO(source_code).readline)
x = tokenize.untokenize((t,s) for t, s, *_ in tokens)
print(x)  # f"""\r{test}}"""

tomasr8 · 2024-10-05T21:45:14Z

I think the issue is in this method:

cpython/Lib/tokenize.py

Lines 187 to 208 in 16cd6cc

    
           def escape_brackets(self, token): 
        
               characters = [] 
        
               consume_until_next_bracket = False 
        
               for character in token: 
        
                   if character == "}": 
        
                       if consume_until_next_bracket: 
        
                           consume_until_next_bracket = False 
        
                       else: 
        
                           characters.append(character) 
        
                   if character == "{": 
        
                       n_backslashes = sum( 
        
                           1 for char in _itertools.takewhile( 
        
                               "\\".__eq__, 
        
                               characters[-2::-1] 
        
                           ) 
        
                       ) 
        
                       if n_backslashes % 2 == 0: 
        
                           characters.append(character) 
        
                       else: 
        
                           consume_until_next_bracket = True 
        
                   characters.append(character) 
        
               return "".join(characters)

This PR fixed the handling of Unicode literals (e.g. \\N{foo}), but it seems to only
be checking for the presence of backslashes without checking if they are followed by N. This appears to fix that:

            if character == "{":
                n_backslashes = sum(
                    1 for char in _itertools.takewhile(
                        "\\".__eq__,
                        characters[-2::-1]
                    )
                )
-               if n_backslashes % 2 == 0:
+               if n_backslashes % 2 == 0 or characters[-1] != "N":
                    characters.append(character)
                else:
                    consume_until_next_bracket = True

…onGH-125013) (cherry picked from commit db23b8b) Co-authored-by: Tomas R. <[email protected]>

…125013) (#125021)

…125013) (#125020)

cf python/cpython#125008

wyattscarpenter · 2024-10-26T16:19:18Z

Extremely late for me to say this, but I thought I'd add: I only unearthed this bug because I'm working on code that incidentally tries to tokenizer-roundtrip everything in https://github.com/hauntsaninja/mypy_primer. So, doing something like that (possibly literally just that) as a test case for tokenize could perhaps be a good idea to prevent future regressions — although setting that up sounds like a hassle!

tomasr8 · 2024-10-26T17:28:52Z

We actually already do that here for some random files in the test folder:

cpython/Lib/test/test_tokenize.py

Lines 1822 to 1835 in 26d6277

    
           class TestRoundtrip(TestCase): 
        
               def check_roundtrip(self, f): 
        
                   """ 
        
                   Test roundtrip for `untokenize`. `f` is an open file or a string. 
        
                   The source code in f is tokenized to both 5- and 2-tuples. 
        
                   Both sequences are converted back to source code via 
        
                   tokenize.untokenize(), and the latter tokenized again to 2-tuples. 
        
                   The test fails if the 3 pair tokenizations do not match. 
        
                   When untokenize bugs are fixed, untokenize with 5-tuples should 
        
                   reproduce code that does not contain a backslash continuation 
        
                   following spaces.  A proper test should test this. 
        
                   """

Though, so far it only compares the tokens, not the actual source code. I'd like to extend this check to compare the source code as well here: #126010

wyattscarpenter added the type-bug An unexpected behavior, bug, or error label Oct 5, 2024

wyattscarpenter mentioned this issue Oct 5, 2024

Implement support for "mypy: ignore" comments python/mypy#17875

Open

AlexWaygood assigned pablogsal and lysnikolaou Oct 5, 2024

AlexWaygood added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Oct 5, 2024

Eclips4 added the topic-parser label Oct 5, 2024

bedevere-app bot mentioned this issue Oct 5, 2024

gh-125008: Fix tokenize.untokenize roundtrip for \n{{ #125013

Merged

pablogsal pushed a commit that referenced this issue Oct 6, 2024

gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (#125013)

db23b8b

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 6, 2024

pythongh-125008: Fix tokenize.untokenize roundtrip for \n{{ (pyth…

9b7c8af

…onGH-125013) (cherry picked from commit db23b8b) Co-authored-by: Tomas R. <[email protected]>

miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 6, 2024

pythongh-125008: Fix tokenize.untokenize roundtrip for \n{{ (pyth…

acc2968

…onGH-125013) (cherry picked from commit db23b8b) Co-authored-by: Tomas R. <[email protected]>

This was referenced Oct 6, 2024

[3.13] gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (GH-125013) #125020

Merged

[3.12] gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (GH-125013) #125021

Merged

pablogsal closed this as completed Oct 6, 2024

pablogsal pushed a commit that referenced this issue Oct 6, 2024

[3.12] gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (GH-…

db4b382

…125013) (#125021)

pablogsal pushed a commit that referenced this issue Oct 6, 2024

[3.13] gh-125008: Fix tokenize.untokenize roundtrip for \n{{ (GH-…

b30da22

…125013) (#125020)

wyattscarpenter added a commit to wyattscarpenter/mypy that referenced this issue Oct 7, 2024

implement workaround for \n{{ roundtripping bug

ef13623

cf python/cpython#125008

pablogsal mentioned this issue Oct 22, 2024

untokenize of specially crafted escaped characters does not round trip properly #125821

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenize does not roundtrip {{ after \n #125008

Tokenize does not roundtrip {{ after \n #125008

wyattscarpenter commented Oct 5, 2024 •

edited by bedevere-app bot

Loading

wyattscarpenter commented Oct 5, 2024 •

edited

Loading

AlexWaygood commented Oct 5, 2024

AlexWaygood commented Oct 5, 2024

tomasr8 commented Oct 5, 2024

tomasr8 commented Oct 5, 2024

wyattscarpenter commented Oct 26, 2024

tomasr8 commented Oct 26, 2024

Tokenize does not roundtrip {{ after \n #125008

Tokenize does not roundtrip {{ after \n #125008

Comments

wyattscarpenter commented Oct 5, 2024 • edited by bedevere-app bot Loading

Bug report

Bug description:

CPython versions tested on:

Operating systems tested on:

Linked PRs

wyattscarpenter commented Oct 5, 2024 • edited Loading

AlexWaygood commented Oct 5, 2024

AlexWaygood commented Oct 5, 2024

tomasr8 commented Oct 5, 2024

tomasr8 commented Oct 5, 2024

wyattscarpenter commented Oct 26, 2024

tomasr8 commented Oct 26, 2024

wyattscarpenter commented Oct 5, 2024 •

edited by bedevere-app bot

Loading

wyattscarpenter commented Oct 5, 2024 •

edited

Loading