add a checker for misleading unicode #5311

CarliJoy · 2021-11-15T16:27:53Z

Type of Changes

	Type
✓	✨ New feature
✓	📜 Docs

Description

Unicode and some other ASCII characters can be used to create programs that run
much different as a human reader would expect it.

PEP 672 lists some examples.

The added checkers are intended to make users aware of these issues.

Some of the test might be too strict for users programming in other languages as
english. As we assume this is minority than can disable the given tests, this should
be not an issue.

Note that the codes try to be prepared for UTF-16 and UTF-32 even so Python files
can't be encoded in them yet.
There is a checker that would detect UTF-16/UTF-32 files and fail, so even the usage
of an old pylint version would not cause a "security hole" when used with new Python
version that allow these encodings.

Closes #5281

CarliJoy · 2021-11-15T16:52:09Z

@Pierre-Sassoulas here my suggestion. Sure some work / decision have to put into them.
Also I wasn't sure about how to use the function tests in opposite the unit tests.

tests/data/unicode/early_return.txt

coveralls · 2021-11-16T07:25:36Z

Pull Request Test Coverage Report for Build 1687416478

159 of 159 (100.0%) changed or added relevant lines in 1 file are covered.
40 unchanged lines in 3 files lost coverage.
Overall coverage increased (+0.1%) to 93.842%

Files with Coverage Reduction	New Missed Lines	%
pylint/checkers/variables.py	5	96.2%
pylint/checkers/refactoring/refactoring_checker.py	6	98.18%
pylint/checkers/classes/class_checker.py	29	94.5%

Totals
Change from base Build 1679646293:	0.1%
Covered Lines:	14644
Relevant Lines:	15605

💛 - Coveralls

CarliJoy · 2021-11-26T10:55:35Z

@Pierre-Sassoulas I am a bit undecided on what to do with the "old" non ascii checker.

I wrote a new checker for two reasons:

The old checker did not caught all usages of non-ascii chars in Python code
The old checker only raised a Convention Warning but as wrote this to check for a security feature (as described in the PEP), I wanted to raise a Error.

Anyhow having similar checks makes not a lot of sense, and also leads to this last failing tests.
How should I proceed?

A suggestion would be to to remove the old non-ascii checker from NameChecker and include it into the new Unicode name checker. Then I could include an option that raises an error instead of an convention for non ascii names, so this isn't a braking change.

Or we could try to build in the new checks into NameChecker class, but his would increase complexity.

Did you have time to look in the MR yet?
We also could have a call if you like.

I will now try to move out the test files to the correct places and write more proper unit tests (documentation about writing an Plugin is quite different from the requested code contribution style btw...).

CarliJoy · 2021-11-28T21:59:36Z

@Pierre-Sassoulas could it be that detecting column and line ending works only with Python 3.8? Or did I do something wrong regarding returning the correct node? confused

Also I am not sure what to make with the failing Windows tests: I don't have Windows to test stuff atm. but I am a bit confused why it should fail due to unicode errors (as written in the other ASCII only test) -> I remember that unicode worked fine in Windows, when I was using it in my old company: Is this a CI issues?

Pierre-Sassoulas · 2021-11-28T22:09:45Z

could it be that detecting column and line ending works only with Python 3.8?

Yes, sorry about that we added an indication in #5418 but it's not merged yet.

Pierre-Sassoulas · 2021-11-28T22:12:19Z

Sorry I did not have the time to review yet, finally releasing 2.12 made up for a busy week 😄

CarliJoy · 2021-11-29T17:01:10Z

LoL the Windows functional test is really strange.
I added a unit test using the same file but it is not failing -> Say What?

Pierre-Sassoulas · 2021-11-29T17:29:39Z

the Windows functional test is really strange.

The windows tests are rather hard to debug because I don't think we have a Windows developers 😄

I added a unit test using the same file but it is not failing

I think the problem is due to the column being calculated differently for Windows, we check the line, column, end_line, end_column of the messages in functional tests which is not the case in your unit test. The peculiar unicode character might prevent python's ast (and thus astroid) to return the proper value on Windows ?

CarliJoy · 2021-11-29T18:31:33Z

the Windows functional test is really strange.

The windows tests are rather hard to debug because I don't think we have a Windows developers smile

I figured it out, was my fault, I didn't check for byte encoding, when handing the "\r\n" line endings.
Now it seems that only Python 3.8 is failing due to the problem with the end columns...

One open questions is still how to proceed with the two different "non-ascii-" checks.

Should I replace the old one with mine?
Keep it as it is?
Remove my new check (hopefully not)
Replace the old one with the new one but add an options that escalates the Convention to an error?

Easiest solution would be the replacement. So I would remove the non-ascii-name from the NameChecker and adopt the error message on mine.

What do you think?

Pierre-Sassoulas

Regarding the old checker for non ascii character, I think they should both stay. if your new check does the same thing than the old one (or more), you can rename it using old_names (see the checker for missing-docstring for example).

doc/whatsnew/2.12.rst

CarliJoy · 2021-11-29T18:57:39Z

Regarding the old checker for non ascii character, I think they should both stay. if your new check does the same thing than the old one (or more), you can rename it using old_names (see the checker for missing-docstring for example).

It actually does the same very same thing but covers much more use cases (i.e. imports, except as variable names)...
I will look into the missing-docstring stuff to see how to rename it.

If I rename it, the functionality should still stay within Names Checker? Because then we would check for the same thing twice??

Pierre-Sassoulas · 2021-11-29T19:12:15Z

If I rename it, the functionality should still stay within Names Checker? Because then we would check for the same thing twice??

Seeing it's doing the same thing + more, you could register the new check in your new checker by registering the old name in the new checker so we handle retro-compatibility with the old name, and remove the old check and old code from NameChecker so we do not waste time checking the same thing here. It would be nice to have a functional test using the old name so we can check that it still works like before.

CarliJoy · 2021-12-13T16:31:14Z

@Pierre-Sassoulas I replaced the old name checker as agreed and merge the current master version.

Still I am getting problems with the end-col stuff in Python 3.8 but I really have no clue how I could fix this.

Could you support me fixing it?

Pierre-Sassoulas · 2021-12-13T16:39:05Z

It look like the proper column is not available with python 3.8 AST, There's not a lot we can do, parsing better than the AST in astroid, is not something that takes a reasonable time and we almost never do it (We're not testing the line/column before python 3.8 because there is a lot better AST since python 3.8). Maybe it would be enough to test for python 3.9 and superior (we could add a .rc configuration file with min_version). Or we could have two functional files with the same content, one for < 3.9 and one for > 3.9 (one with min_version, one with max_version).

Pierre-Sassoulas · 2021-12-13T16:41:07Z

To be clearer : python 3.6 and 3.7 do not have line/column information in this case but it's expected, that's why tests for 3.6 and 3.7 are green.

DanielNoord · 2021-12-13T16:59:14Z

Still I am getting problems with the end-col stuff in Python 3.8 but I really have no clue how I could fix this.

This problem (sadly) goes beyond the end line and end column. column in 3.8 for the affected tests is 0, while it should be 4. Thus, 3.8 does not only struggle with end line and end column (which could be fixed by supplying a min_pyver_end_position test option), but also with the beginning column.

I'm not sure what the best way to solve this is: I don't think we have any test options that would allow creating one expected output that satisfies 3.6, 3.7 and 3.8 and one for 3.9 and 3.10. We could create tests for 1) 3.6, 3.7, 2) 3.8 and 3) 3.9, 3.10 but that seems like a lot of duplication.. Not sure what to do here..

CarliJoy · 2021-12-14T11:32:51Z

Why does it not occur in the other functional tests? Is it due to an special AST object I am using?

My suggestion for a solution would be the following:
Lets functional test also look for {testname}.{pyversion}.{txt} i.e.
unicode_homoglyph_class_constant.py38.txt and if it exists prefer it before unicode_homoglyph_class_constant.txt

This way we don't have a lot of duplication in the function test and are also able to special AST behavior of special versions.

What do you think? Should I include this behavior in the PR?

DanielNoord · 2021-12-14T11:36:35Z

Why does it not occur in the other functional tests? Is it due to an special AST object I am using?

I think it might be due to the special unicode character that the ast parsers doesn't recognise.. Perhaps using another glyph fixes it?

My suggestion for a solution would be the following: Lets functional test also look for {testname}.{pyversion}.{txt} i.e. unicode_homoglyph_class_constant.py38.txt and if it exists prefer it before unicode_homoglyph_class_constant.txt

This way we don't have a lot of duplication in the function test and are also able to special AST behavior of special versions.

This would still requires us to make three different files right? For 3.6, 3.7, for 3.8 and for 3.9, 3.10? If we make three files we could use max_pyversion as well of course.

What do you think? Should I include this behavior in the PR?

If we go ahead with this this should be done in a different PR.

CarliJoy · 2021-12-14T11:50:49Z

Why does it not occur in the other functional tests? Is it due to an special AST object I am using?

I think it might be due to the special unicode character that the ast parsers doesn't recognise.. Perhaps using another glyph fixes it?

I could try this. But even if it is like this, it would be better if we are aware of this issue and have a test for it or not? Maybe I also look into the Astroid to see if there is a fix for this. Or do you think the problem originates already from the Python 3.8 AST parser?

My suggestion for a solution would be the following: Lets functional test also look for {testname}.{pyversion}.{txt} i.e. unicode_homoglyph_class_constant.py38.txt and if it exists prefer it before unicode_homoglyph_class_constant.txt
This way we don't have a lot of duplication in the function test and are also able to special AST behavior of special versions.

This would still requires us to make three different files right? For 3.6, 3.7, for 3.8 and for 3.9, 3.10? If we make three files we could use max_pyversion as well of course.

Nope this would require only one extra .py38 file at the moment. For all other tests it is running fine.

What do you think? Should I include this behavior in the PR?

If we go ahead with this this should be done in a different PR.
I estimated that ;-)
To make testing easier, I would create it in this PR and then cherry pick the change for a new PR

Should I do this?

Pierre-Sassoulas · 2021-12-14T11:59:39Z

I agree with @DanielNoord that it's something that need another MR. I like the idea but it's not a small change.

I think this is necessary in order to handle small inconsistencies in the AST for python interpreter > 3.8 and still being able to check the that the column is correct most of the time. This is the first time the line number is different between 3.8 and 3.9 or 3.10 by the way :) !

I don't think we should start testing the value from python 3.6 and 3.7. Especially not 3.6 as it will not be required anymore in 9 days. The issue happens all the time for those interpreters, and we already test the column number on other versions. Also the wrong information come directly from the AST so it's not linked to pylint and we don't control it. The way we would create the functional tests output would be with the current output, it shows that it's not really useful.

Also some "testname" contain a pyversion already so it would be a tad confusing to keep them this way. Another reason this need another MR.

Regarding inferior or superior to a python version what about:

Filename	Example	Versions
`{testname}.{pyversion}.{txt}`	`unicode_homoglyph_class_constant.py38.txt`	py38
`{testname}.{pyversion}-.{txt}`	`unicode_homoglyph_class_constant.py38-.txt`	< py38
`{testname}.{pyversion}+.{txt}`	`unicode_homoglyph_class_constant.py38+.txt`	>= py38

DanielNoord · 2021-12-14T12:03:47Z

What about making a test for 3.8 and a test for 3.9, 3.10? I forgot 3.6 is deprecated in 9 days.

Although I think we could implement it, the proposed change for test filenames is quite large (also because we already use a similar pattern) and I don't want to give you a lot of extra work @CarliJoy. Writing all of these additional testutils just so we can do 1 functional test on 1 single python version does not seem like a correct work/reward balance.

Pierre-Sassoulas · 2021-12-14T12:33:47Z

What about making a test for 3.8 and a test for 3.9, 3.10?

Sounds good, {testname}.{pyversion}.{txt} (expected output for a specific version) is the only one we need right now then.

I just thought about the update script that will need to raise an error when your python interpreter cannot upgrade the generic output. Say you have python 3.8 and unicode_homoglyph_class_constant.py38.txt exists, you update this one but you won't be able to update unicode_homoglyph_class_constant.txt, you'd need py39 ou py310

Ref #5311 Co-authored-by: Carli* Freudenberg <[email protected]>

Ref pylint-dev#5311 Co-authored-by: Carli* Freudenberg <[email protected]>

DanielNoord

Another round. I'm having a very busy week so I don't have enough time to do a full review probably.

I've looked at NonAsciiNamesChecker and requested some changes. @CarliJoy Would you perhaps reconsider putting some of these changes in a separate PR? Especially the NonAsciiNamesChecker should be fairly trivial to extract out of this PR and it would make the reviewing a lot easier. (We'll also get a more tangible result quicker, as we'll likely be able to merge a part of this sooner).

doc/whatsnew/2.13.rst

pylint/checkers/non_ascii_names.py

CarliJoy · 2022-01-06T10:26:48Z

@DanielNoord I extract the ASCII checker from this PR as on your request.

CarliJoy · 2022-01-06T19:31:38Z

I combined the changes. So there are not this crazy amount of commits any more. Content is the same.

Ref pylint-dev#5311 Co-authored-by: Carli* Freudenberg <[email protected]>

DanielNoord

Took a first look at the actual checker, which already seems almost good to go.

I'll take another look at the tests as well tomorrow.

base_checker.py

pylint/checkers/unicode.py

Co-authored-by: Daniël van Noord <[email protected]>

CarliJoy · 2022-01-11T09:49:54Z

github says, that I was mentioned here 2h ago but I don't see where... 😿

DanielNoord

Minor changes, but rest looks good! Thanks @CarliJoy 🎉

pylint/checkers/unicode.py

tests/checkers/unittest_unicode/__init__.py

tests/checkers/unittest_unicode/bad_chars.py

Co-authored-by: Daniël van Noord <[email protected]>

CarliJoy · 2022-01-13T09:08:58Z

@Pierre-Sassoulas anything left to do from my side?

Pierre-Sassoulas

Thanks a lot for all the work @CarliJoy, without doubt one of the major addition to pylint 2.13.0 !

Pierre-Sassoulas · 2022-01-13T11:20:56Z

tests/checkers/unittest_unicode/unittest_bad_chars.py

+                suffix[0],
+                id=f"{codec_and_msg[0]}_{line_ending[1]}_{suffix[1]}",
+            )
+            for codec_and_msg, line_ending, suffix in itertools.product(


Nice parametrize !

Pierre-Sassoulas added Checkers Related to a checker Enhancement ✨ Improvement to a component labels Nov 15, 2021

Pierre-Sassoulas added this to the 2.13.0 milestone Nov 15, 2021

Pierre-Sassoulas added the Needs review 🔍 Needs to be reviewed by one or multiple more persons label Nov 15, 2021

Pierre-Sassoulas self-requested a review November 15, 2021 17:16

Pierre-Sassoulas reviewed Nov 15, 2021

View reviewed changes

tests/data/unicode/early_return.txt Outdated Show resolved Hide resolved

CarliJoy force-pushed the unicode_checks branch from 235a273 to af44181 Compare November 18, 2021 09:13

DanielNoord mentioned this pull request Nov 25, 2021

Move TestVariablesChecker to functional tests #5400

Merged

2 tasks

Pierre-Sassoulas reviewed Nov 29, 2021

View reviewed changes

doc/whatsnew/2.12.rst Outdated Show resolved Hide resolved

DanielNoord added a commit that referenced this pull request Jan 5, 2022

Move HUMAN_READABLE_TYPES to constants (#5642)

29997f8

Ref #5311 Co-authored-by: Carli* Freudenberg <[email protected]>

DanielNoord added a commit to DanielNoord/pylint that referenced this pull request Jan 5, 2022

Move non-ascii-name to separate file.

f24ac16

Ref pylint-dev#5311 Co-authored-by: Carli* Freudenberg <[email protected]>

DanielNoord requested changes Jan 5, 2022

View reviewed changes

CarliJoy mentioned this pull request Jan 6, 2022

Improve non ascii checker #5643

Merged

4 tasks

add checker for invalid unicode chars

734ca46

CarliJoy force-pushed the unicode_checks branch from dcf244a to 734ca46 Compare January 6, 2022 19:28

DanielNoord self-requested a review January 7, 2022 08:24

areveny pushed a commit to areveny/pylint that referenced this pull request Jan 9, 2022

Move HUMAN_READABLE_TYPES to constants (pylint-dev#5642)

32c6f3a

Ref pylint-dev#5311 Co-authored-by: Carli* Freudenberg <[email protected]>

DanielNoord requested changes Jan 10, 2022

View reviewed changes

Pierre-Sassoulas and others added 5 commits January 10, 2022 22:51

Merge branch 'main' into unicode_checks

a4c45ea

Apply suggestions from code review

c6d46e1

Co-authored-by: Daniël van Noord <[email protected]>

Adopt comment from code review

34df5d9

adopt new checker id, fix functional test

0db5f0e

add missing docstring to unittest

4b553f3

DanielNoord approved these changes Jan 12, 2022

View reviewed changes

CarliJoy and others added 3 commits January 12, 2022 10:59

Apply suggestions from code review

041fecb

Co-authored-by: Daniël van Noord <[email protected]>

apply changes from pr review

b76c64a

improve readability

2d0a2ec

DanielNoord self-requested a review January 12, 2022 11:53

fix test names

36b442a

CarliJoy requested a review from Pierre-Sassoulas January 12, 2022 11:57

Pierre-Sassoulas approved these changes Jan 13, 2022

View reviewed changes

Pierre-Sassoulas removed the Needs review 🔍 Needs to be reviewed by one or multiple more persons label Jan 13, 2022

Pierre-Sassoulas merged commit aaa3770 into pylint-dev:main Jan 13, 2022

Lucas-C mentioned this pull request Feb 7, 2022

New check: B113: TrojanSource - Bidirectional control characters PyCQA/bandit#757

Merged

Pierre-Sassoulas mentioned this pull request Feb 12, 2023

HELP WANTED: Help copy content from pylint-errors over to our own documentation #5953

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a checker for misleading unicode #5311

add a checker for misleading unicode #5311

CarliJoy commented Nov 15, 2021

CarliJoy commented Nov 15, 2021

coveralls commented Nov 16, 2021 •

edited

Loading

CarliJoy commented Nov 26, 2021

CarliJoy commented Nov 28, 2021

Pierre-Sassoulas commented Nov 28, 2021

Pierre-Sassoulas commented Nov 28, 2021

CarliJoy commented Nov 29, 2021

Pierre-Sassoulas commented Nov 29, 2021

CarliJoy commented Nov 29, 2021

Pierre-Sassoulas left a comment

CarliJoy commented Nov 29, 2021 •

edited

Loading

Pierre-Sassoulas commented Nov 29, 2021 •

edited

Loading

CarliJoy commented Dec 13, 2021

Pierre-Sassoulas commented Dec 13, 2021

Pierre-Sassoulas commented Dec 13, 2021

DanielNoord commented Dec 13, 2021

CarliJoy commented Dec 14, 2021

DanielNoord commented Dec 14, 2021

CarliJoy commented Dec 14, 2021

Pierre-Sassoulas commented Dec 14, 2021

DanielNoord commented Dec 14, 2021

Pierre-Sassoulas commented Dec 14, 2021

DanielNoord left a comment

CarliJoy commented Jan 6, 2022

CarliJoy commented Jan 6, 2022

DanielNoord left a comment

CarliJoy commented Jan 11, 2022

DanielNoord left a comment

CarliJoy commented Jan 13, 2022

Pierre-Sassoulas left a comment

Pierre-Sassoulas Jan 13, 2022

add a checker for misleading unicode #5311

add a checker for misleading unicode #5311

Conversation

CarliJoy commented Nov 15, 2021

Type of Changes

Description

CarliJoy commented Nov 15, 2021

coveralls commented Nov 16, 2021 • edited Loading

Pull Request Test Coverage Report for Build 1687416478

💛 - Coveralls

CarliJoy commented Nov 26, 2021

CarliJoy commented Nov 28, 2021

Pierre-Sassoulas commented Nov 28, 2021

Pierre-Sassoulas commented Nov 28, 2021

CarliJoy commented Nov 29, 2021

Pierre-Sassoulas commented Nov 29, 2021

CarliJoy commented Nov 29, 2021

Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

CarliJoy commented Nov 29, 2021 • edited Loading

Pierre-Sassoulas commented Nov 29, 2021 • edited Loading

CarliJoy commented Dec 13, 2021

Pierre-Sassoulas commented Dec 13, 2021

Pierre-Sassoulas commented Dec 13, 2021

DanielNoord commented Dec 13, 2021

CarliJoy commented Dec 14, 2021

DanielNoord commented Dec 14, 2021

CarliJoy commented Dec 14, 2021

Pierre-Sassoulas commented Dec 14, 2021

DanielNoord commented Dec 14, 2021

Pierre-Sassoulas commented Dec 14, 2021

DanielNoord left a comment

Choose a reason for hiding this comment

CarliJoy commented Jan 6, 2022

CarliJoy commented Jan 6, 2022

DanielNoord left a comment

Choose a reason for hiding this comment

CarliJoy commented Jan 11, 2022

DanielNoord left a comment

Choose a reason for hiding this comment

CarliJoy commented Jan 13, 2022

Pierre-Sassoulas left a comment

Choose a reason for hiding this comment

Pierre-Sassoulas Jan 13, 2022

Choose a reason for hiding this comment

coveralls commented Nov 16, 2021 •

edited

Loading

CarliJoy commented Nov 29, 2021 •

edited

Loading

Pierre-Sassoulas commented Nov 29, 2021 •

edited

Loading