LookupError: unknown encoding: 'b'utf8'' #307

jh-rheinhardt · 2024-11-20T18:12:21Z

Installed using conda on Windows 11 machine. Working through the tutorial and

response.css("title")

Gives this error:

File "src\lxml\parser.pxi", line 1806, in lxml.etree.HTMLParser.init
File "src\lxml\parser.pxi", line 858, in lxml.etree._BaseParser.init
LookupError: unknown encoding: 'b'utf8''

scrapy 2.12.0 py311h1ea47a8_1 conda-forge
twisted 23.10.0 py311haa95532_0
lxml 5.2.1 py311h395c83e_1

Gallaecio · 2024-11-20T18:19:41Z

Could you provide the entire spider code that produces the error, and the entire traceback? (of which you shared the 3 last lines)

jh-rheinhardt · 2024-11-20T18:37:17Z

from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)

\> scrapy shell "https://quotes.toscrape.com/page/1/"

>>>response.css("title")


Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\scrapy\http\response\text.py", line 168, in css
    return cast(SelectorList, self.selector.css(query))
                              ^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\scrapy\http\response\text.py", line 147, in selector
    self._cached_selector = Selector(self)
                            ^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\scrapy\selector\unified.py", line 102, in __init__
    super().__init__(text=text, type=st, **kwargs)
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 496, in __init__
    root, type = _get_root_and_type_from_text(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 377, in _get_root_and_type_from_text
    root = _get_root_from_text(text, type=type, **lxml_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 329, in _get_root_from_text
    return create_root_node(text, _ctgroup[type]["_parser"], **lxml_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 110, in create_root_node
    parser = parser_cls(recover=True, encoding=encoding, huge_tree=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\lxml\html\__init__.py", line 1887, in __init__
    super().__init__(**kwargs)
  File "src\\lxml\\parser.pxi", line 1806, in lxml.etree.HTMLParser.__init__
  File "src\\lxml\\parser.pxi", line 858, in lxml.etree._BaseParser.__init__
LookupError: unknown encoding: 'b'utf8''

Gallaecio · 2024-11-21T11:42:24Z

I cannot reproduce the issue with Scrapy 2.12:

$ scrapy shell "https://quotes.toscrape.com/page/1/"
2024-11-21 12:36:57 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2024-11-21 12:36:57 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.10.0, Python 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.3, Platform Linux-6.11.8-zen1-2-zen-x86_64-with-glibc2.40
2024-11-21 12:36:57 [scrapy.addons] INFO: Enabled addons:
[]
2024-11-21 12:36:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2024-11-21 12:36:57 [scrapy.extensions.telnet] INFO: Telnet Password: 7f709a2a4ad7bf38
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2024-11-21 12:36:57 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0}
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-11-21 12:36:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-11-21 12:36:57 [scrapy.core.engine] INFO: Spider opened
2024-11-21 12:37:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fc7dbb220c0>
[s]   item       {}
[s]   request    <GET https://quotes.toscrape.com/page/1/>
[s]   response   <200 https://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fc7dbad0290>
[s]   spider     <DefaultSpider 'default' at 0x7fc7db60dbb0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> response.css("title")
[<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

When using scrapy shell, your spider code should be irrelevant. I’m using Linux, but the traceback does not look OS-specific. It is about something like str(b'utf-8') being passed as encoding somewhere (that’s how you would end up with 'b'utf8''). But I find it hard to believe that it happens with the latest versions of Scrapy and parsel without any encoding-related user input.

In any case, I cannot reproduce the issue with the provided code.

Please, try upgrading parsel and Scrapy and try again.

If the problem persists, see if you can reproduce the issue installing Scrapy with a Python virtual environment instead of using conda.

ChrMosau · 2024-12-02T16:33:21Z

I have the same situation as @jh-rheinhardt . I installed scrapy via conda-forge on a Windows 11 system and got the same error.

I tested a second environment (inspired by @kourosh-amiri-sani ) where I did not install scrapy directly when creating the environment (i.e. without conda-forge), but via pip after activating the environment and it worked.

wRAR · 2024-12-02T17:09:13Z

There are two problems here. First, the incorrect bytes->str conversion in lxml when printing the error message (and only then). Second, ~~lxml~~libxml2 in conda(?) not knowing the utf8 encoding (as opposed to e.g. utf-8).

wRAR · 2024-12-02T17:28:39Z

The default encoding in parsel.selector.create_root_node() is "utf8" (since that code was added in 2012). lxml (in lxml.etree._BaseParser.__init__() passes that directly to libxml2's xmlFindCharEncodingHandler() and the result apparently depends on various things we can't control.

While we could change this to some safer default ("utf-8" probably), I'm not sure what's the correct robust way to do this. Note that we also use the same encoding argument in str.encode().

Gallaecio · 2024-12-09T10:12:24Z

While we could change this to some safer default ("utf-8" probably)

👍

longlong211 · 2024-12-25T21:39:45Z

change the env

wRAR added the needs more info label Nov 22, 2024

This comment was marked as off-topic.

Sign in to view

wRAR transferred this issue from scrapy/scrapy Dec 2, 2024

Gallaecio added bug good first issue and removed needs more info labels Dec 9, 2024

wRAR mentioned this issue Dec 11, 2024

Use "utf-8" instead of "utf8" as the default encoding. #308

Merged

Gallaecio closed this as completed in #308 Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LookupError: unknown encoding: 'b'utf8'' #307

LookupError: unknown encoding: 'b'utf8'' #307

jh-rheinhardt commented Nov 20, 2024

Gallaecio commented Nov 20, 2024

jh-rheinhardt commented Nov 20, 2024 •

edited by Gallaecio

Loading

Gallaecio commented Nov 21, 2024 •

edited

Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

ChrMosau commented Dec 2, 2024

wRAR commented Dec 2, 2024 •

edited

Loading

wRAR commented Dec 2, 2024 •

edited

Loading

Gallaecio commented Dec 9, 2024

longlong211 commented Dec 25, 2024

LookupError: unknown encoding: 'b'utf8'' #307

LookupError: unknown encoding: 'b'utf8'' #307

Comments

jh-rheinhardt commented Nov 20, 2024

Gallaecio commented Nov 20, 2024

jh-rheinhardt commented Nov 20, 2024 • edited by Gallaecio Loading

Gallaecio commented Nov 21, 2024 • edited Loading

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

ChrMosau commented Dec 2, 2024

wRAR commented Dec 2, 2024 • edited Loading

wRAR commented Dec 2, 2024 • edited Loading

Gallaecio commented Dec 9, 2024

longlong211 commented Dec 25, 2024

jh-rheinhardt commented Nov 20, 2024 •

edited by Gallaecio

Loading

Gallaecio commented Nov 21, 2024 •

edited

Loading

wRAR commented Dec 2, 2024 •

edited

Loading

wRAR commented Dec 2, 2024 •

edited

Loading