Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a checker for misleading unicode #5311

Merged
merged 10 commits into from
Jan 13, 2022

Conversation

CarliJoy
Copy link
Contributor

Type of Changes

Type
✨ New feature
📜 Docs

Description

Unicode and some other ASCII characters can be used to create programs that run
much different as a human reader would expect it.

PEP 672 lists some examples.

The added checkers are intended to make users aware of these issues.

Some of the test might be too strict for users programming in other languages as
english. As we assume this is minority than can disable the given tests, this should
be not an issue.

Note that the codes try to be prepared for UTF-16 and UTF-32 even so Python files
can't be encoded in them yet.
There is a checker that would detect UTF-16/UTF-32 files and fail, so even the usage
of an old pylint version would not cause a "security hole" when used with new Python
version that allow these encodings.

Closes #5281

@CarliJoy
Copy link
Contributor Author

@Pierre-Sassoulas here my suggestion. Sure some work / decision have to put into them.
Also I wasn't sure about how to use the function tests in opposite the unit tests.

@Pierre-Sassoulas Pierre-Sassoulas added Checkers Related to a checker Enhancement ✨ Improvement to a component labels Nov 15, 2021
@Pierre-Sassoulas Pierre-Sassoulas added this to the 2.13.0 milestone Nov 15, 2021
@Pierre-Sassoulas Pierre-Sassoulas added the Needs review 🔍 Needs to be reviewed by one or multiple more persons label Nov 15, 2021
@Pierre-Sassoulas Pierre-Sassoulas self-requested a review November 15, 2021 17:16
@coveralls
Copy link

coveralls commented Nov 16, 2021

Pull Request Test Coverage Report for Build 1687416478

  • 159 of 159 (100.0%) changed or added relevant lines in 1 file are covered.
  • 40 unchanged lines in 3 files lost coverage.
  • Overall coverage increased (+0.1%) to 93.842%

Files with Coverage Reduction New Missed Lines %
pylint/checkers/variables.py 5 96.2%
pylint/checkers/refactoring/refactoring_checker.py 6 98.18%
pylint/checkers/classes/class_checker.py 29 94.5%
Totals Coverage Status
Change from base Build 1679646293: 0.1%
Covered Lines: 14644
Relevant Lines: 15605

💛 - Coveralls

@CarliJoy
Copy link
Contributor Author

@Pierre-Sassoulas I am a bit undecided on what to do with the "old" non ascii checker.

I wrote a new checker for two reasons:

  • The old checker did not caught all usages of non-ascii chars in Python code
  • The old checker only raised a Convention Warning but as wrote this to check for a security feature (as described in the PEP), I wanted to raise a Error.

Anyhow having similar checks makes not a lot of sense, and also leads to this last failing tests.
How should I proceed?

A suggestion would be to to remove the old non-ascii checker from NameChecker and include it into the new Unicode name checker. Then I could include an option that raises an error instead of an convention for non ascii names, so this isn't a braking change.

Or we could try to build in the new checks into NameChecker class, but his would increase complexity.

Did you have time to look in the MR yet?
We also could have a call if you like.

I will now try to move out the test files to the correct places and write more proper unit tests (documentation about writing an Plugin is quite different from the requested code contribution style btw...).

@CarliJoy
Copy link
Contributor Author

@Pierre-Sassoulas could it be that detecting column and line ending works only with Python 3.8? Or did I do something wrong regarding returning the correct node? confused

Also I am not sure what to make with the failing Windows tests: I don't have Windows to test stuff atm. but I am a bit confused why it should fail due to unicode errors (as written in the other ASCII only test) -> I remember that unicode worked fine in Windows, when I was using it in my old company: Is this a CI issues?

@Pierre-Sassoulas
Copy link
Member

could it be that detecting column and line ending works only with Python 3.8?

Yes, sorry about that we added an indication in #5418 but it's not merged yet.

@Pierre-Sassoulas
Copy link
Member

Sorry I did not have the time to review yet, finally releasing 2.12 made up for a busy week 😄

@CarliJoy
Copy link
Contributor Author

LoL the Windows functional test is really strange.
I added a unit test using the same file but it is not failing -> Say What?

@Pierre-Sassoulas
Copy link
Member

the Windows functional test is really strange.

The windows tests are rather hard to debug because I don't think we have a Windows developers 😄

I added a unit test using the same file but it is not failing

I think the problem is due to the column being calculated differently for Windows, we check the line, column, end_line, end_column of the messages in functional tests which is not the case in your unit test. The peculiar unicode character might prevent python's ast (and thus astroid) to return the proper value on Windows ?

@CarliJoy
Copy link
Contributor Author

the Windows functional test is really strange.

The windows tests are rather hard to debug because I don't think we have a Windows developers smile

I figured it out, was my fault, I didn't check for byte encoding, when handing the "\r\n" line endings.
Now it seems that only Python 3.8 is failing due to the problem with the end columns...

One open questions is still how to proceed with the two different "non-ascii-" checks.

  • Should I replace the old one with mine?
  • Keep it as it is?
  • Remove my new check (hopefully not)
  • Replace the old one with the new one but add an options that escalates the Convention to an error?

Easiest solution would be the replacement. So I would remove the non-ascii-name from the NameChecker and adopt the error message on mine.

What do you think?

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the old checker for non ascii character, I think they should both stay. if your new check does the same thing than the old one (or more), you can rename it using old_names (see the checker for missing-docstring for example).

@CarliJoy
Copy link
Contributor Author

CarliJoy commented Nov 29, 2021

Regarding the old checker for non ascii character, I think they should both stay. if your new check does the same thing than the old one (or more), you can rename it using old_names (see the checker for missing-docstring for example).

It actually does the same very same thing but covers much more use cases (i.e. imports, except as variable names)...
I will look into the missing-docstring stuff to see how to rename it.

If I rename it, the functionality should still stay within Names Checker? Because then we would check for the same thing twice??

@Pierre-Sassoulas
Copy link
Member

Pierre-Sassoulas commented Nov 29, 2021

If I rename it, the functionality should still stay within Names Checker? Because then we would check for the same thing twice??

Seeing it's doing the same thing + more, you could register the new check in your new checker by registering the old name in the new checker so we handle retro-compatibility with the old name, and remove the old check and old code from NameChecker so we do not waste time checking the same thing here. It would be nice to have a functional test using the old name so we can check that it still works like before.

@CarliJoy
Copy link
Contributor Author

@Pierre-Sassoulas I replaced the old name checker as agreed and merge the current master version.

Still I am getting problems with the end-col stuff in Python 3.8 but I really have no clue how I could fix this.

Could you support me fixing it?

@Pierre-Sassoulas
Copy link
Member

It look like the proper column is not available with python 3.8 AST, There's not a lot we can do, parsing better than the AST in astroid, is not something that takes a reasonable time and we almost never do it (We're not testing the line/column before python 3.8 because there is a lot better AST since python 3.8). Maybe it would be enough to test for python 3.9 and superior (we could add a .rc configuration file with min_version). Or we could have two functional files with the same content, one for < 3.9 and one for > 3.9 (one with min_version, one with max_version).

@Pierre-Sassoulas
Copy link
Member

To be clearer : python 3.6 and 3.7 do not have line/column information in this case but it's expected, that's why tests for 3.6 and 3.7 are green.

@DanielNoord
Copy link
Collaborator

Still I am getting problems with the end-col stuff in Python 3.8 but I really have no clue how I could fix this.

This problem (sadly) goes beyond the end line and end column. column in 3.8 for the affected tests is 0, while it should be 4. Thus, 3.8 does not only struggle with end line and end column (which could be fixed by supplying a min_pyver_end_position test option), but also with the beginning column.

I'm not sure what the best way to solve this is: I don't think we have any test options that would allow creating one expected output that satisfies 3.6, 3.7 and 3.8 and one for 3.9 and 3.10. We could create tests for 1) 3.6, 3.7, 2) 3.8 and 3) 3.9, 3.10 but that seems like a lot of duplication.. Not sure what to do here..

@CarliJoy
Copy link
Contributor Author

Why does it not occur in the other functional tests? Is it due to an special AST object I am using?

My suggestion for a solution would be the following:
Lets functional test also look for {testname}.{pyversion}.{txt} i.e.
unicode_homoglyph_class_constant.py38.txt and if it exists prefer it before unicode_homoglyph_class_constant.txt

This way we don't have a lot of duplication in the function test and are also able to special AST behavior of special versions.

What do you think? Should I include this behavior in the PR?

@DanielNoord
Copy link
Collaborator

Why does it not occur in the other functional tests? Is it due to an special AST object I am using?

I think it might be due to the special unicode character that the ast parsers doesn't recognise.. Perhaps using another glyph fixes it?

My suggestion for a solution would be the following: Lets functional test also look for {testname}.{pyversion}.{txt} i.e. unicode_homoglyph_class_constant.py38.txt and if it exists prefer it before unicode_homoglyph_class_constant.txt

This way we don't have a lot of duplication in the function test and are also able to special AST behavior of special versions.

This would still requires us to make three different files right? For 3.6, 3.7, for 3.8 and for 3.9, 3.10? If we make three files we could use max_pyversion as well of course.

What do you think? Should I include this behavior in the PR?

If we go ahead with this this should be done in a different PR.

@CarliJoy
Copy link
Contributor Author

Why does it not occur in the other functional tests? Is it due to an special AST object I am using?

I think it might be due to the special unicode character that the ast parsers doesn't recognise.. Perhaps using another glyph fixes it?

I could try this. But even if it is like this, it would be better if we are aware of this issue and have a test for it or not? Maybe I also look into the Astroid to see if there is a fix for this. Or do you think the problem originates already from the Python 3.8 AST parser?

My suggestion for a solution would be the following: Lets functional test also look for {testname}.{pyversion}.{txt} i.e. unicode_homoglyph_class_constant.py38.txt and if it exists prefer it before unicode_homoglyph_class_constant.txt
This way we don't have a lot of duplication in the function test and are also able to special AST behavior of special versions.

This would still requires us to make three different files right? For 3.6, 3.7, for 3.8 and for 3.9, 3.10? If we make three files we could use max_pyversion as well of course.

Nope this would require only one extra .py38 file at the moment. For all other tests it is running fine.

What do you think? Should I include this behavior in the PR?

If we go ahead with this this should be done in a different PR.
I estimated that ;-)
To make testing easier, I would create it in this PR and then cherry pick the change for a new PR

Should I do this?

@Pierre-Sassoulas
Copy link
Member

I agree with @DanielNoord that it's something that need another MR. I like the idea but it's not a small change.

I think this is necessary in order to handle small inconsistencies in the AST for python interpreter > 3.8 and still being able to check the that the column is correct most of the time. This is the first time the line number is different between 3.8 and 3.9 or 3.10 by the way :) !

I don't think we should start testing the value from python 3.6 and 3.7. Especially not 3.6 as it will not be required anymore in 9 days. The issue happens all the time for those interpreters, and we already test the column number on other versions. Also the wrong information come directly from the AST so it's not linked to pylint and we don't control it. The way we would create the functional tests output would be with the current output, it shows that it's not really useful.

Also some "testname" contain a pyversion already so it would be a tad confusing to keep them this way. Another reason this need another MR.

Regarding inferior or superior to a python version what about:

Filename Example Versions
{testname}.{pyversion}.{txt} unicode_homoglyph_class_constant.py38.txt py38
{testname}.{pyversion}-.{txt} unicode_homoglyph_class_constant.py38-.txt < py38
{testname}.{pyversion}+.{txt} unicode_homoglyph_class_constant.py38+.txt >= py38

@DanielNoord
Copy link
Collaborator

What about making a test for 3.8 and a test for 3.9, 3.10? I forgot 3.6 is deprecated in 9 days.

Although I think we could implement it, the proposed change for test filenames is quite large (also because we already use a similar pattern) and I don't want to give you a lot of extra work @CarliJoy. Writing all of these additional testutils just so we can do 1 functional test on 1 single python version does not seem like a correct work/reward balance.

@Pierre-Sassoulas
Copy link
Member

What about making a test for 3.8 and a test for 3.9, 3.10?

Sounds good, {testname}.{pyversion}.{txt} (expected output for a specific version) is the only one we need right now then.

I just thought about the update script that will need to raise an error when your python interpreter cannot upgrade the generic output. Say you have python 3.8 and unicode_homoglyph_class_constant.py38.txt exists, you update this one but you won't be able to update unicode_homoglyph_class_constant.txt, you'd need py39 ou py310

DanielNoord added a commit that referenced this pull request Jan 5, 2022
Ref #5311

Co-authored-by: Carli* Freudenberg <[email protected]>
DanielNoord added a commit to DanielNoord/pylint that referenced this pull request Jan 5, 2022
Copy link
Collaborator

@DanielNoord DanielNoord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another round. I'm having a very busy week so I don't have enough time to do a full review probably.

I've looked at NonAsciiNamesChecker and requested some changes. @CarliJoy Would you perhaps reconsider putting some of these changes in a separate PR? Especially the NonAsciiNamesChecker should be fairly trivial to extract out of this PR and it would make the reviewing a lot easier. (We'll also get a more tangible result quicker, as we'll likely be able to merge a part of this sooner).

@CarliJoy
Copy link
Contributor Author

CarliJoy commented Jan 6, 2022

@DanielNoord I extract the ASCII checker from this PR as on your request.

@CarliJoy CarliJoy mentioned this pull request Jan 6, 2022
4 tasks
@CarliJoy
Copy link
Contributor Author

CarliJoy commented Jan 6, 2022

I combined the changes. So there are not this crazy amount of commits any more. Content is the same.

@DanielNoord DanielNoord self-requested a review January 7, 2022 08:24
areveny pushed a commit to areveny/pylint that referenced this pull request Jan 9, 2022
Copy link
Collaborator

@DanielNoord DanielNoord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took a first look at the actual checker, which already seems almost good to go.

I'll take another look at the tests as well tomorrow.

@CarliJoy
Copy link
Contributor Author

github says, that I was mentioned here 2h ago but I don't see where... 😿

Copy link
Collaborator

@DanielNoord DanielNoord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor changes, but rest looks good! Thanks @CarliJoy 🎉

@DanielNoord DanielNoord self-requested a review January 12, 2022 11:53
@CarliJoy
Copy link
Contributor Author

@Pierre-Sassoulas anything left to do from my side?

Copy link
Member

@Pierre-Sassoulas Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for all the work @CarliJoy, without doubt one of the major addition to pylint 2.13.0 !

suffix[0],
id=f"{codec_and_msg[0]}_{line_ending[1]}_{suffix[1]}",
)
for codec_and_msg, line_ending, suffix in itertools.product(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice parametrize !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Checkers Related to a checker Enhancement ✨ Improvement to a component
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unpaired Bidi Control Character Detection
4 participants