gh-125553: Fix backslash continuation in `untokenize` #126010

tomasr8 · 2024-10-26T15:28:47Z

This change correctly inserts whitespace before backslash + newline in most cases.
The exception is cases where the backslash is on its own line which makes the untokenization ambiguous. For example:

a
  b
    c
  \
  c

However, these should be pretty rare. In fact, I ran the tokenize -> untokenize round-trip check on the whole repo (excluding files which fail to tokenize in the first place) and all of the files produce the same output.

Issue: untokenize() does not round-trip for code containing line breaks (\ + \n) #125553

Lib/test/test_tokenize.py

bedevere-bot · 2024-10-27T00:48:21Z

🤖 New build scheduled with the buildbot fleet by @pablogsal for commit bf56e7d 🤖

If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again.

Lib/tokenize.py

pablogsal · 2024-10-27T00:55:32Z

This fails tokenization of the stdlib:

======================================================================FAIL: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-aarch64.clang-installed/build/target/lib/python3.14/test/test_traceback.py')----------------------------------------------------------------------Traceback (most recent call last):  File "/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-aarch64.clang-installed/build/target/lib/python3.14/test/test_tokenize.py", line 2014, in test_random_files    self.check_roundtrip(f)    ~~~~~~~~~~~~~~~~~~~~^^^  File "/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-aarch64.clang-installed/build/target/lib/python3.14/test/test_tokenize.py", line 1863, in check_roundtrip    self.assertEqual(code_without_bom, tokenize.untokenize(tokenize.tokenize(readline)))    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^AssertionError: b'"""[34950 chars]         \\\n            )  # test\n          [156429 chars]()\n' != b'"""[34950 chars]     \\\n            )  # test\n            )\[156425 chars]()\n'----------------------------------------------------------------------

Normally test_random_files just runs a subset of them. You need to run with -uall to run all the tests.

pablogsal

See my last comment

bedevere-app · 2024-10-27T00:55:54Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

And if you don't make the requested changes, you will be put in the comfy chair!

tomasr8 · 2024-10-27T10:40:26Z

This fails tokenization of the stdlib:

======================================================================FAIL: test_random_files (test.test_tokenize.TestRoundtrip.test_random_files) (file='/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-aarch64.clang-installed/build/target/lib/python3.14/test/test_traceback.py')----------------------------------------------------------------------Traceback (most recent call last):  File "/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-aarch64.clang-installed/build/target/lib/python3.14/test/test_tokenize.py", line 2014, in test_random_files    self.check_roundtrip(f)    ~~~~~~~~~~~~~~~~~~~~^^^  File "/home/buildbot/buildarea/pull_request.cstratak-fedora-stable-aarch64.clang-installed/build/target/lib/python3.14/test/test_tokenize.py", line 1863, in check_roundtrip    self.assertEqual(code_without_bom, tokenize.untokenize(tokenize.tokenize(readline)))    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^AssertionError: b'"""[34950 chars]         \\\n            )  # test\n          [156429 chars]()\n' != b'"""[34950 chars]     \\\n            )  # test\n            )\[156425 chars]()\n'----------------------------------------------------------------------

Normally test_random_files just runs a subset of them. You need to run with -uall to run all the tests.

Right, I actually explicitly excluded test_traceback.py when testing this change locally because it contains the one case which we currently cannot correctly untokenize:

cpython/Lib/test/test_traceback.py

Lines 856 to 867 in dad3453

    
           def f_with_binary_operator(): 
        
               b = 1 
        
               c = "" 
        
               a = ( 
        
                   (b  # test + 
        
                       )  \ 
        
                   # + 
        
               << (c  # test 
        
                   \ 
        
               )  # test 
        
               ) 
        
               return a

Line 864 has a backslash on a separate line. This is problematic because the tokenizer does not generate any tokens for that line so there is no way to know the indent of the backslash and the untokenization is ambiguous.

We could perhaps skips this file in the test? ~~Or add a fail assertion?~~ Or just compare the tokens?

pablogsal · 2024-10-28T23:13:18Z

Line 864 has a backslash on a separate line. This is problematic because the tokenizer does not generate any tokens for that line so there is no way to know the indent of the backslash and the untokenization is ambiguous.

But currently is un-tokenizing correctly no? I am a bit concerned that this is going to introduce some changes to roundtrip behavior that people are relying on. We cannot just skip the file or the test as well, that would be concerning because we will be missing coverage.

tomasr8 · 2024-10-28T23:33:03Z

But currently is un-tokenizing correctly no?

It's not - both main and this pr insert less whitespace than they should. The behaviour is incorrect but does not change.

The reason the test_traceback.py file is not passing now is because I made the roundtrip test more strict. Previously it was only checking that the tokens are the same, now it checks the untokenized source code exactly. For that one failing test, we could simply keep comparing just the tokens for now as we already do.

As a side note, with the current untokenizer, when I compare the diff of test_traceback.py there are 86 changed lines, with this PR there's only one (that is breaking the test 😄 )

pablogsal · 2024-10-28T23:47:46Z

Ah thanks for explaining it in detail. Ok that makes sense and looks much better 🙂

Now we just need a slightly elegant way to use token base comparison only on that particular file.

Maybe we can identify if the line falls into that case and then skip the source comparison? Otherwise maybe we can just use tokens for that file and keep a list of "known less strict files"

tomasr8 · 2024-10-29T12:35:32Z

No problem 🙂

Now we just need a slightly elegant way to use token base comparison only on that particular file.

I actually added a new boolean parameter compare_tokens_only to the round-trip test which reverts to the old behaviour of
only comparing tokens. I updated the tests to use this for test_traceback.py.

Though, detecting this automatically rather than having a list of files might be relatively simple. We'd just need to check whether the source code contains a line with just a backslash and potentially some whitespace. I'll look into that later today :)

tomasr8 · 2024-10-29T21:42:49Z

I tried adding an automated check for the standalone backslash but it actually produces false positives for multiline strings. For example:

s = r"""\
pass
        \
pass
"""

This is not great because we would mistakenly turn off strict checking for this file. I would say keeping a list of "known less strict files" is preferable to this?

FWIW, test_traceback.py is the only file in the repo with the weird backslash and I don't suppose the list will grow any time soon..

pablogsal · 2024-10-29T23:02:42Z

FWIW, test_traceback.py is the only file in the repo with the weird backslash and I don't suppose the list will grow any time soon..

It may, if someone suddenly adds backlashes to some random file, tests will start to fail for no reason and that will be very frustrating to core devs and contributors so I do think that is a big concern. In fact, is a deal breaker to check for this in test_random_files. So we either deactivate this new check when testing for all files or we find a way we like to deactivate the check if there are backslashes.

tomasr8 · 2024-10-30T13:05:55Z

It may, if someone suddenly adds backlashes to some random file, tests will start to fail for no reason and that will be very frustrating to core devs and contributors so I do think that is a big concern.

I totally agree that this would not be ideal. In that case, we can use the automated check I suggested before. The upside is that this will never randomly fail if someone adds a backslash to a file.

The downside is that the check has false positives so we might accidentally disable the more strict check for a file (there are currently two files like that). Though I think this is still overall an improvement since the vast majority of files will be compared exactly.

If you think it's acceptable I'll revert the last change which added the explicit list.

pablogsal · 2024-10-30T13:24:37Z

It may, if someone suddenly adds backlashes to some random file, tests will start to fail for no reason and that will be very frustrating to core devs and contributors so I do think that is a big concern.

I totally agree that this would not be ideal. In that case, we can use the automated check I suggested before. The upside is that this will never randomly fail if someone adds a backslash to a file.

The downside is that the check has false positives so we might accidentally disable the more strict check for a file (there are currently two files like that). Though I think this is still overall an improvement since the vast majority of files will be compared exactly.

If you think it's acceptable I'll revert the last change which added the explicit list.

Yeah, unfortunately I don't think we have a choice since having files randomly fail if someone changes valid code is a no-go. So any improvement even if suboptimal that doesn't have this problem is our only option

tomasr8 · 2024-10-30T20:39:19Z

Changes made, if you'd like to have another look :)

the magic phrase:
I have made the requested changes; please review again

tomasr8 · 2024-10-30T20:41:33Z

Ok one more time, since the last one didn't work:

I have made the requested changes; please review again

bedevere-app · 2024-10-30T20:41:38Z

Thanks for making the requested changes!

@pablogsal: please review the changes made to this pull request.

tomasr8 · 2024-11-18T13:47:39Z

friendly reminder @pablogsal 🙂

pablogsal · 2025-01-21T19:28:53Z

Closing and opening to retrigger CI

This reverts commit eb2e6f2.

miss-islington-app · 2025-01-21T20:12:21Z

Thanks @tomasr8 for the PR, and @pablogsal for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13.
🐍🍒⛏🤖 I'm not a witch! I'm not a witch!

miss-islington-app · 2025-01-21T20:12:22Z

Thanks @tomasr8 for the PR, and @pablogsal for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12.
🐍🍒⛏🤖

…-126010) (cherry picked from commit 7ad793e) Co-authored-by: Tomas R. <[email protected]>

miss-islington-app · 2025-01-21T20:12:30Z

Sorry, @tomasr8 and @pablogsal, I could not cleanly backport this to 3.12 due to a conflict.
Please backport using cherry_picker on command line.

cherry_picker 7ad793e5dbdf07e51a71b70d20f3e6e3ab60244d 3.12

bedevere-app · 2025-01-21T20:12:33Z

GH-129153 is a backport of this pull request to the 3.13 branch.

…) (#129153) gh-125553: Fix backslash continuation in `untokenize` (GH-126010) (cherry picked from commit 7ad793e) Co-authored-by: Tomas R <[email protected]>

tomasr8 requested review from pablogsal and lysnikolaou as code owners October 26, 2024 15:28

bedevere-app bot added the awaiting review label Oct 26, 2024

bedevere-app bot mentioned this pull request Oct 26, 2024

untokenize() does not round-trip for code containing line breaks (\ + \n) #125553

Closed

tomasr8 commented Oct 26, 2024

View reviewed changes

Lib/test/test_tokenize.py Outdated Show resolved Hide resolved

Lib/test/test_tokenize.py Show resolved Hide resolved

tomasr8 mentioned this pull request Oct 26, 2024

Tokenize does not roundtrip {{ after \n #125008

Closed

pablogsal added the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Oct 27, 2024

bedevere-bot removed the 🔨 test-with-buildbots Test PR w/ buildbots; report in status section label Oct 27, 2024

pablogsal reviewed Oct 27, 2024

View reviewed changes

Lib/tokenize.py Show resolved Hide resolved

pablogsal requested changes Oct 27, 2024

View reviewed changes

bedevere-app bot added awaiting changes and removed awaiting review labels Oct 27, 2024

bedevere-app bot added awaiting change review and removed awaiting changes labels Oct 30, 2024

bedevere-app bot requested a review from pablogsal October 30, 2024 20:41

pablogsal closed this Jan 21, 2025

pablogsal reopened this Jan 21, 2025

tomasr8 added 8 commits January 21, 2025 19:33

Fix backslash continuation in untokenize

3bc07b8

Add news entry

cc2fb5e

Fix Windows

ca62935

Be more lenient with test_traceback

a595dde

Check if a file can be compared exactly

6f6a688

Simplify regex

497067a

Use a list for ambiguous files

4b32c8e

Revert "Use a list for ambiguous files"

e2c9bb7

This reverts commit eb2e6f2.

pablogsal force-pushed the tokenize-backslash branch from 3a25da8 to e2c9bb7 Compare January 21, 2025 19:33

pablogsal removed the awaiting change review label Jan 21, 2025

pablogsal enabled auto-merge (squash) January 21, 2025 19:43

pablogsal merged commit 7ad793e into python:main Jan 21, 2025
39 of 40 checks passed

pablogsal added needs backport to 3.12 bug and security fixes needs backport to 3.13 bugs and security fixes labels Jan 21, 2025

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Jan 21, 2025

pythongh-125553: Fix backslash continuation in untokenize (pythonGH…

dd6c757

…-126010) (cherry picked from commit 7ad793e) Co-authored-by: Tomas R. <[email protected]>

miss-islington-app bot assigned pablogsal Jan 21, 2025

bedevere-app bot removed the needs backport to 3.13 bugs and security fixes label Jan 21, 2025

tomasr8 deleted the tokenize-backslash branch January 21, 2025 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-125553: Fix backslash continuation in `untokenize` #126010

gh-125553: Fix backslash continuation in `untokenize` #126010

tomasr8 commented Oct 26, 2024 •

edited by bedevere-app bot

Loading

bedevere-bot commented Oct 27, 2024

pablogsal commented Oct 27, 2024

pablogsal left a comment

bedevere-app bot commented Oct 27, 2024

tomasr8 commented Oct 27, 2024 •

edited

Loading

pablogsal commented Oct 28, 2024

tomasr8 commented Oct 28, 2024

pablogsal commented Oct 28, 2024

tomasr8 commented Oct 29, 2024

tomasr8 commented Oct 29, 2024

pablogsal commented Oct 29, 2024 •

edited

Loading

tomasr8 commented Oct 30, 2024

pablogsal commented Oct 30, 2024

tomasr8 commented Oct 30, 2024 •

edited

Loading

tomasr8 commented Oct 30, 2024

bedevere-app bot commented Oct 30, 2024

tomasr8 commented Nov 18, 2024

pablogsal commented Jan 21, 2025

miss-islington-app bot commented Jan 21, 2025

miss-islington-app bot commented Jan 21, 2025

miss-islington-app bot commented Jan 21, 2025

bedevere-app bot commented Jan 21, 2025

gh-125553: Fix backslash continuation in untokenize #126010

gh-125553: Fix backslash continuation in untokenize #126010

Conversation

tomasr8 commented Oct 26, 2024 • edited by bedevere-app bot Loading

bedevere-bot commented Oct 27, 2024

pablogsal commented Oct 27, 2024

pablogsal left a comment

Choose a reason for hiding this comment

bedevere-app bot commented Oct 27, 2024

tomasr8 commented Oct 27, 2024 • edited Loading

pablogsal commented Oct 28, 2024

tomasr8 commented Oct 28, 2024

pablogsal commented Oct 28, 2024

tomasr8 commented Oct 29, 2024

tomasr8 commented Oct 29, 2024

pablogsal commented Oct 29, 2024 • edited Loading

tomasr8 commented Oct 30, 2024

pablogsal commented Oct 30, 2024

tomasr8 commented Oct 30, 2024 • edited Loading

tomasr8 commented Oct 30, 2024

bedevere-app bot commented Oct 30, 2024

tomasr8 commented Nov 18, 2024

pablogsal commented Jan 21, 2025

miss-islington-app bot commented Jan 21, 2025

miss-islington-app bot commented Jan 21, 2025

miss-islington-app bot commented Jan 21, 2025

bedevere-app bot commented Jan 21, 2025

gh-125553: Fix backslash continuation in `untokenize` #126010

gh-125553: Fix backslash continuation in `untokenize` #126010

tomasr8 commented Oct 26, 2024 •

edited by bedevere-app bot

Loading

tomasr8 commented Oct 27, 2024 •

edited

Loading

pablogsal commented Oct 29, 2024 •

edited

Loading

tomasr8 commented Oct 30, 2024 •

edited

Loading