Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't inject unicodedata2 into sys.modules #57

Merged
merged 1 commit into from
Jul 14, 2021
Merged

Conversation

akx
Copy link
Contributor

@akx akx commented Jul 14, 2021

I noticed charset_normalizer meddles with sys.modules, causing this:

>>> import charset_normalizer
>>> import unicodedata
>>> unicodedata
<module 'unicodedata2' from '.../site-packages/unicodedata2.cpython-39-darwin.so'>

This PR fixes that by using a fairly standard try: except ImportError: guard instead of the sys.modules hook.

>>> import charset_normalizer
>>> import unicodedata
>>> unicodedata
<module 'unicodedata' from '.../python3.9/lib-dynload/unicodedata.cpython-39-darwin.so'>

@Ousret
Copy link
Member

Ousret commented Jul 14, 2021

Hi @akx

Thanks for your report and your PR proposal.

Why do you say that PR "fixes" when this behavior is well documented and is expected?
I would be open to change that, considering recent events.

I have some further questions regarding this patch.
Why did you install and not expect it to be used instead of your cPython unicodedata distribution?

Removing the charset_normalizer.hook and replacing it with a plain import won't be useful unless we create some intermediary compat like solution.

It would be nicer to propose something that both keeps the backward compatibility AND "fix" your concern.

Thanks.

@akx
Copy link
Contributor Author

akx commented Jul 14, 2021

Hi @Ousret.

As far as I can see, it's certainly not documented that simply importing charset_normalizer while having unicodedata2 installed will make it impossible to access unicodedata (unless you had imported unicodedata beforehand).

Sure, it's documented that charset_normalizer will internally use unicodedata2 if it's available, and that's fine, but I don't expect it to mess with the global state of my Python interpreter. If I wanted to patch in unicodedata2 for all unicodedatas in my interpreter, I'd want to be explicit about that. (Explicit is better than implicit.)

My proposal is exactly the same mechanism that patches in charset_normalizer for chardet in requests (which is how I stumbled upon this library in the first place):

https://github.com/psf/requests/blob/a1a6a549a0143d9b32717dbe3d75cd543ae5a4f6/requests/compat.py#L11-L14

@Ousret
Copy link
Member

Ousret commented Jul 14, 2021

I really do not understand why having more up-to-date unicodedata is messing with your global Python environment. Anyway.
Like I said earlier, I am open to change that behavior, but I must be convinced that using your method/patch actually does what is expected.

@akx
Copy link
Contributor Author

akx commented Jul 14, 2021

I really do not understand why having more up-to-date unicodedata is messing with your global Python environment.

It's a matter of principle. If I do import unicodedata, I want unicodedata, not another module just because I had happened to import an unrelated module (charset_normalizer) before importing unicodedata.

It's maybe slightly far-fetched, but without this patch, you can't easily write a program that compares the differences between unicodedata and unicodedata2 unless you're specific about your import order!

I must be convinced that using your method/patch actually does what is expected.

Well, I'm pretty sure the CI suite for charset_normalizer would show that it doesn't break things. Can you enable running the GitHub workflows for this PR?

@Ousret
Copy link
Member

Ousret commented Jul 14, 2021

I need another person's opinion on that. @sethmlarson What do you think of that?
Unfornutally, in the current state the test suite does not prove without a doupt that unicodedata2 is correctly used. I am working on it.

How is that the right thing for us?

import unicodata2 as unicodedata

"a".isalpha()

What I am concern about is methods like .isalpha() from a str instance. How do you ptr them to unicodedata2?

@akx
Copy link
Contributor Author

akx commented Jul 14, 2021

import unicodedata2 as unicodedata

"a".isalpha()

What I am concern about is methods like .isalpha() from a str instance. How do you ptr them to unicodedata2?

As mentioned in #57 (comment) , I don't believe having unicodedata2 installed will help at all with str.isalpha(), etc. calls. In other words, you'd need to have your own def isalpha(s) sort of implementation that'd consult unicodedata2's tables, and doing that in pure Python will likely be slow.

@sethmlarson
Copy link

@Ousret Don't know much about the unicodedata module or whether it even interacts with isalpha, from a brief reading of CPython it doesn't seem like it interacts with this static table: https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Python/pyctype.c

If you chase down the implementation of str.isalpha it ends up checking flags in this table, at least from my reading.

If this is the case we shouldn't be injecting unicodedata2 as it doesn't modify str.isalpha behavior.

@akx
Copy link
Contributor Author

akx commented Jul 14, 2021

@sethmlarson

whether it even interacts with isalpha, from a brief reading of CPython it doesn't seem like it interacts with this static table: https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Python/pyctype.c

That's the non-Unicode table. See #57 (comment) for the Unicode isalpha chase.

@codecov-commenter
Copy link

codecov-commenter commented Jul 14, 2021

Codecov Report

Merging #57 (a30cf9f) into master (929f13c) will increase coverage by 0.03%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #57      +/-   ##
==========================================
+ Coverage   84.46%   84.49%   +0.03%     
==========================================
  Files          12       11       -1     
  Lines        1062     1058       -4     
==========================================
- Hits          897      894       -3     
+ Misses        165      164       -1     
Impacted Files Coverage Δ
charset_normalizer/__init__.py 100.00% <ø> (ø)
charset_normalizer/utils.py 76.25% <100.00%> (+0.52%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 929f13c...a30cf9f. Read the comment docs.

@Ousret
Copy link
Member

Ousret commented Jul 14, 2021

Thanks, @akx @sethmlarson for your inputs.
I am okay with merging this. I don't know when a new tag will be available, It would be wise to wait upon any remarks or concerns beforehand.

Do you have any more concerns @akx ?

@akx
Copy link
Contributor Author

akx commented Jul 14, 2021

@Ousret I don't have any further concerns regarding this PR :) Thank you for your consideration!

@Ousret
Copy link
Member

Ousret commented Jul 14, 2021

You are welcome, thanks for your contribution as well.

@Ousret Ousret merged commit 53b2dab into jawah:master Jul 14, 2021
@Ousret Ousret mentioned this pull request Jul 14, 2021
@Ousret
Copy link
Member

Ousret commented Jul 14, 2021

@dopplershift
Copy link

I just found this while trying to debug a problem, and I just wanted to make absolutely clear that the old way causes problems. With charset_normalizer 2.0.0 (and 2.0.1), I saw this behavior:

No import:

❯ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:59:23) 
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> compile(r's="$\N{DEGREE FAHRENHEIT}$"', 'foo.py', 'exec')
<code object <module> at 0x7fde000c4810, file "foo.py", line 1>
>>> 

With import charset_normalizer:

❯ python
Python 3.7.12 | packaged by conda-forge | (default, Oct 26 2021, 05:59:23) 
[Clang 11.1.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import charset_normalizer
>>> compile(r's="$\N{DEGREE FAHRENHEIT}$"', 'foo.py', 'exec')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "foo.py", line 1
SyntaxError: (unicode error) \N escapes not supported (can't load unicodedata module)
>>> 

With the fix merged here in 2.0.2 (and the latest 2.0.7), the problem disappears. (I have an entirely different issue to solve where conda-forge is still giving me 2.0.0 🤷‍♂️ )

@Ousret
Copy link
Member

Ousret commented Nov 24, 2021

Thanks for the report. Indeed I saw this weird behavior at conda-forge giving by default 2.0.0 instead of latest. I do not know why and who to reach to explain that. https://anaconda.org/conda-forge/charset-normalizer/files The download tendency is revealing that something seems wrong. Could it be that the requests version requirement ~=2.0.* for this lib is applied in the wrong way?

@dopplershift
Copy link

Incorrect dependencies for requests was exactly the reason: conda-forge/requests-feedstock#48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants