-
-
Notifications
You must be signed in to change notification settings - Fork 31.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gh-125553: Fix backslash continuation in untokenize
#126010
Conversation
🤖 New build scheduled with the buildbot fleet by @pablogsal for commit bf56e7d 🤖 If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
This fails tokenization of the stdlib:
Normally |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my last comment
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase And if you don't make the requested changes, you will be put in the comfy chair! |
Right, I actually explicitly excluded cpython/Lib/test/test_traceback.py Lines 856 to 867 in dad3453
Line 864 has a backslash on a separate line. This is problematic because the tokenizer does not generate any tokens for that line so there is no way to know the indent of the backslash and the untokenization is ambiguous. We could perhaps skips this file in the test? |
But currently is un-tokenizing correctly no? I am a bit concerned that this is going to introduce some changes to roundtrip behavior that people are relying on. We cannot just skip the file or the test as well, that would be concerning because we will be missing coverage. |
It's not - both main and this pr insert less whitespace than they should. The behaviour is incorrect but does not change. The reason the As a side note, with the current untokenizer, when I compare the diff of |
Ah thanks for explaining it in detail. Ok that makes sense and looks much better 🙂 Now we just need a slightly elegant way to use token base comparison only on that particular file. Maybe we can identify if the line falls into that case and then skip the source comparison? Otherwise maybe we can just use tokens for that file and keep a list of "known less strict files" |
No problem 🙂
I actually added a new boolean parameter Though, detecting this automatically rather than having a list of files might be relatively simple. We'd just need to check whether the source code contains a line with just a backslash and potentially some whitespace. I'll look into that later today :) |
I tried adding an automated check for the standalone backslash but it actually produces false positives for multiline strings. For example: s = r"""\
pass
\
pass
""" This is not great because we would mistakenly turn off strict checking for this file. I would say keeping a list of "known less strict files" is preferable to this? FWIW, |
It may, if someone suddenly adds backlashes to some random file, tests will start to fail for no reason and that will be very frustrating to core devs and contributors so I do think that is a big concern. In fact, is a deal breaker to check for this in |
I totally agree that this would not be ideal. In that case, we can use the automated check I suggested before. The upside is that this will never randomly fail if someone adds a backslash to a file. The downside is that the check has false positives so we might accidentally disable the more strict check for a file (there are currently two files like that). Though I think this is still overall an improvement since the vast majority of files will be compared exactly. If you think it's acceptable I'll revert the last change which added the explicit list. |
Yeah, unfortunately I don't think we have a choice since having files randomly fail if someone changes valid code is a no-go. So any improvement even if suboptimal that doesn't have this problem is our only option |
Changes made, if you'd like to have another look :) the magic phrase: |
Ok one more time, since the last one didn't work: I have made the requested changes; please review again |
Thanks for making the requested changes! @pablogsal: please review the changes made to this pull request. |
friendly reminder @pablogsal 🙂 |
Closing and opening to retrigger CI |
This reverts commit eb2e6f2.
3a25da8
to
e2c9bb7
Compare
Thanks @tomasr8 for the PR, and @pablogsal for merging it 🌮🎉.. I'm working now to backport this PR to: 3.13. |
Thanks @tomasr8 for the PR, and @pablogsal for merging it 🌮🎉.. I'm working now to backport this PR to: 3.12. |
…-126010) (cherry picked from commit 7ad793e) Co-authored-by: Tomas R. <[email protected]>
Sorry, @tomasr8 and @pablogsal, I could not cleanly backport this to
|
GH-129153 is a backport of this pull request to the 3.13 branch. |
…) (#129153) gh-125553: Fix backslash continuation in `untokenize` (GH-126010) (cherry picked from commit 7ad793e) Co-authored-by: Tomas R <[email protected]>
This change correctly inserts whitespace before backslash + newline in most cases.
The exception is cases where the backslash is on its own line which makes the untokenization ambiguous. For example:
However, these should be pretty rare. In fact, I ran the tokenize -> untokenize round-trip check on the whole repo (excluding files which fail to tokenize in the first place) and all of the files produce the same output.
\
+\n
) #125553