Decoding error with faulty Websites encoding #138

baderdean · 2024-10-23T17:07:37Z

While decoding faulty websites like this one https://www.societe.com/societe/ankaboot-832320170.html

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 2: invalid continuation byte
Exception ignored in: 'selectolax.lexbor.text_callback'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe0 in position 2: invalid continuation byt

This may be fixed, if the default policy changes from "strict" (default) to "replace"

selectolax/selectolax/lexbor/node.pxi

Line 863 in 19ee5e0

py_str = text.decode(_ENCODING)

py_str = text.decode(_ENCODING, "replace")

The text was updated successfully, but these errors were encountered:

JuroOravec · 2024-10-25T07:30:23Z

Hi, I just had a look and turns out that the Modest HTMLParser already handles this by allowing to pass decode_errors kwarg to it:

HTMLParser(html, decode_errors="ignore")

Also for Modest the default is ignore.

I updated the code to have the same behavior for Lexbor, but still need to add tests / document, so I'll finish that over the weekend :)

pineapple-pokopo · 2024-12-30T10:04:50Z

Hi @JuroOravec, thanks for the awesome package! Is there any ETA on a fix for this issue?

JuroOravec mentioned this issue Oct 25, 2024

Align API for Modest / Lexbor implementations #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoding error with faulty Websites encoding #138

Decoding error with faulty Websites encoding #138

baderdean commented Oct 23, 2024 •

edited

Loading

JuroOravec commented Oct 25, 2024

pineapple-pokopo commented Dec 30, 2024

Decoding error with faulty Websites encoding #138

Decoding error with faulty Websites encoding #138

Comments

baderdean commented Oct 23, 2024 • edited Loading

JuroOravec commented Oct 25, 2024

pineapple-pokopo commented Dec 30, 2024

baderdean commented Oct 23, 2024 •

edited

Loading