Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding error with faulty Websites encoding #138

Open
baderdean opened this issue Oct 23, 2024 · 2 comments
Open

Decoding error with faulty Websites encoding #138

baderdean opened this issue Oct 23, 2024 · 2 comments

Comments

@baderdean
Copy link

baderdean commented Oct 23, 2024

While decoding faulty websites like this one https://www.societe.com/societe/ankaboot-832320170.html

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 2: invalid continuation byte
Exception ignored in: 'selectolax.lexbor.text_callback'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 2: invalid continuation byt

This may be fixed, if the default policy changes from "strict" (default) to "replace"

py_str = text.decode(_ENCODING)

py_str = text.decode(_ENCODING, "replace")

@JuroOravec
Copy link
Collaborator

Hi, I just had a look and turns out that the Modest HTMLParser already handles this by allowing to pass decode_errors kwarg to it:

HTMLParser(html, decode_errors="ignore")

Also for Modest the default is ignore.

I updated the code to have the same behavior for Lexbor, but still need to add tests / document, so I'll finish that over the weekend :)

@pineapple-pokopo
Copy link

Hi @JuroOravec, thanks for the awesome package! Is there any ETA on a fix for this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants