Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LookupError: unknown encoding: 'b'utf8'' #307

Closed
jh-rheinhardt opened this issue Nov 20, 2024 · 12 comments · Fixed by #308
Closed

LookupError: unknown encoding: 'b'utf8'' #307

jh-rheinhardt opened this issue Nov 20, 2024 · 12 comments · Fixed by #308

Comments

@jh-rheinhardt
Copy link

Installed using conda on Windows 11 machine. Working through the tutorial and

response.css("title")

Gives this error:

File "src\lxml\parser.pxi", line 1806, in lxml.etree.HTMLParser.init
File "src\lxml\parser.pxi", line 858, in lxml.etree._BaseParser.init
LookupError: unknown encoding: 'b'utf8''

scrapy 2.12.0 py311h1ea47a8_1 conda-forge
twisted 23.10.0 py311haa95532_0
lxml 5.2.1 py311h395c83e_1

@Gallaecio
Copy link
Member

Could you provide the entire spider code that produces the error, and the entire traceback? (of which you shared the 3 last lines)

@jh-rheinhardt
Copy link
Author

jh-rheinhardt commented Nov 20, 2024

from pathlib import Path

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        "https://quotes.toscrape.com/page/1/",
        "https://quotes.toscrape.com/page/2/",
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = f"quotes-{page}.html"
        Path(filename).write_bytes(response.body)
\> scrapy shell "https://quotes.toscrape.com/page/1/"
>>>response.css("title")


Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\scrapy\http\response\text.py", line 168, in css
    return cast(SelectorList, self.selector.css(query))
                              ^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\scrapy\http\response\text.py", line 147, in selector
    self._cached_selector = Selector(self)
                            ^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\scrapy\selector\unified.py", line 102, in __init__
    super().__init__(text=text, type=st, **kwargs)
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 496, in __init__
    root, type = _get_root_and_type_from_text(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 377, in _get_root_and_type_from_text
    root = _get_root_from_text(text, type=type, **lxml_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 329, in _get_root_from_text
    return create_root_node(text, _ctgroup[type]["_parser"], **lxml_kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 110, in create_root_node
    parser = parser_cls(recover=True, encoding=encoding, huge_tree=True)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\lxml\html\__init__.py", line 1887, in __init__
    super().__init__(**kwargs)
  File "src\\lxml\\parser.pxi", line 1806, in lxml.etree.HTMLParser.__init__
  File "src\\lxml\\parser.pxi", line 858, in lxml.etree._BaseParser.__init__
LookupError: unknown encoding: 'b'utf8''

@Gallaecio
Copy link
Member

Gallaecio commented Nov 21, 2024

I cannot reproduce the issue with Scrapy 2.12:

$ scrapy shell "https://quotes.toscrape.com/page/1/"
2024-11-21 12:36:57 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2024-11-21 12:36:57 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.10.0, Python 3.12.7 (main, Oct  1 2024, 11:15:50) [GCC 14.2.1 20240910], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.3, Platform Linux-6.11.8-zen1-2-zen-x86_64-with-glibc2.40
2024-11-21 12:36:57 [scrapy.addons] INFO: Enabled addons:
[]
2024-11-21 12:36:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2024-11-21 12:36:57 [scrapy.extensions.telnet] INFO: Telnet Password: 7f709a2a4ad7bf38
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.memusage.MemoryUsage']
2024-11-21 12:36:57 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
 'LOGSTATS_INTERVAL': 0}
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-11-21 12:36:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-11-21 12:36:57 [scrapy.core.engine] INFO: Spider opened
2024-11-21 12:37:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x7fc7dbb220c0>
[s]   item       {}
[s]   request    <GET https://quotes.toscrape.com/page/1/>
[s]   response   <200 https://quotes.toscrape.com/page/1/>
[s]   settings   <scrapy.settings.Settings object at 0x7fc7dbad0290>
[s]   spider     <DefaultSpider 'default' at 0x7fc7db60dbb0>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects 
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>> response.css("title")
[<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>]

When using scrapy shell, your spider code should be irrelevant. I’m using Linux, but the traceback does not look OS-specific. It is about something like str(b'utf-8') being passed as encoding somewhere (that’s how you would end up with 'b'utf8''). But I find it hard to believe that it happens with the latest versions of Scrapy and parsel without any encoding-related user input.

In any case, I cannot reproduce the issue with the provided code.

Please, try upgrading parsel and Scrapy and try again.

If the problem persists, see if you can reproduce the issue installing Scrapy with a Python virtual environment instead of using conda.

@wRAR

This comment was marked as off-topic.

@kourosh-amiri-sani

This comment was marked as off-topic.

@wRAR

This comment was marked as off-topic.

@kourosh-amiri-sani

This comment was marked as off-topic.

@ChrMosau
Copy link

ChrMosau commented Dec 2, 2024

I have the same situation as @jh-rheinhardt . I installed scrapy via conda-forge on a Windows 11 system and got the same error.

I tested a second environment (inspired by @kourosh-amiri-sani ) where I did not install scrapy directly when creating the environment (i.e. without conda-forge), but via pip after activating the environment and it worked.

@wRAR
Copy link
Member

wRAR commented Dec 2, 2024

There are two problems here. First, the incorrect bytes->str conversion in lxml when printing the error message (and only then). Second, lxmllibxml2 in conda(?) not knowing the utf8 encoding (as opposed to e.g. utf-8).

@wRAR wRAR transferred this issue from scrapy/scrapy Dec 2, 2024
@wRAR
Copy link
Member

wRAR commented Dec 2, 2024

The default encoding in parsel.selector.create_root_node() is "utf8" (since that code was added in 2012). lxml (in lxml.etree._BaseParser.__init__() passes that directly to libxml2's xmlFindCharEncodingHandler() and the result apparently depends on various things we can't control.

While we could change this to some safer default ("utf-8" probably), I'm not sure what's the correct robust way to do this. Note that we also use the same encoding argument in str.encode().

@Gallaecio
Copy link
Member

While we could change this to some safer default ("utf-8" probably)

👍

@longlong211
Copy link

change the env

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants