-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LookupError: unknown encoding: 'b'utf8'' #307
Comments
Could you provide the entire spider code that produces the error, and the entire traceback? (of which you shared the 3 last lines) |
from pathlib import Path
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = [
"https://quotes.toscrape.com/page/1/",
"https://quotes.toscrape.com/page/2/",
]
def parse(self, response):
page = response.url.split("/")[-2]
filename = f"quotes-{page}.html"
Path(filename).write_bytes(response.body) \> scrapy shell "https://quotes.toscrape.com/page/1/" >>>response.css("title")
Traceback (most recent call last):
File "<console>", line 1, in <module>
File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\scrapy\http\response\text.py", line 168, in css
return cast(SelectorList, self.selector.css(query))
^^^^^^^^^^^^^
File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\scrapy\http\response\text.py", line 147, in selector
self._cached_selector = Selector(self)
^^^^^^^^^^^^^^
File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\scrapy\selector\unified.py", line 102, in __init__
super().__init__(text=text, type=st, **kwargs)
File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 496, in __init__
root, type = _get_root_and_type_from_text(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 377, in _get_root_and_type_from_text
root = _get_root_from_text(text, type=type, **lxml_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 329, in _get_root_from_text
return create_root_node(text, _ctgroup[type]["_parser"], **lxml_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\parsel\selector.py", line 110, in create_root_node
parser = parser_cls(recover=True, encoding=encoding, huge_tree=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\...\anaconda3\envs\WebScraping\Lib\site-packages\lxml\html\__init__.py", line 1887, in __init__
super().__init__(**kwargs)
File "src\\lxml\\parser.pxi", line 1806, in lxml.etree.HTMLParser.__init__
File "src\\lxml\\parser.pxi", line 858, in lxml.etree._BaseParser.__init__
LookupError: unknown encoding: 'b'utf8'' |
I cannot reproduce the issue with Scrapy 2.12: $ scrapy shell "https://quotes.toscrape.com/page/1/"
2024-11-21 12:36:57 [scrapy.utils.log] INFO: Scrapy 2.12.0 started (bot: scrapybot)
2024-11-21 12:36:57 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.12.9, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.10.0, Python 3.12.7 (main, Oct 1 2024, 11:15:50) [GCC 14.2.1 20240910], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.3, Platform Linux-6.11.8-zen1-2-zen-x86_64-with-glibc2.40
2024-11-21 12:36:57 [scrapy.addons] INFO: Enabled addons:
[]
2024-11-21 12:36:57 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2024-11-21 12:36:57 [scrapy.extensions.telnet] INFO: Telnet Password: 7f709a2a4ad7bf38
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2024-11-21 12:36:57 [scrapy.crawler] INFO: Overridden settings:
{'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'LOGSTATS_INTERVAL': 0}
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-11-21 12:36:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2024-11-21 12:36:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-11-21 12:36:57 [scrapy.core.engine] INFO: Spider opened
2024-11-21 12:37:02 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://quotes.toscrape.com/page/1/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7fc7dbb220c0>
[s] item {}
[s] request <GET https://quotes.toscrape.com/page/1/>
[s] response <200 https://quotes.toscrape.com/page/1/>
[s] settings <scrapy.settings.Settings object at 0x7fc7dbad0290>
[s] spider <DefaultSpider 'default' at 0x7fc7db60dbb0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> response.css("title")
[<Selector query='descendant-or-self::title' data='<title>Quotes to Scrape</title>'>] When using In any case, I cannot reproduce the issue with the provided code. Please, try upgrading parsel and Scrapy and try again. If the problem persists, see if you can reproduce the issue installing Scrapy with a Python virtual environment instead of using conda. |
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
I have the same situation as @jh-rheinhardt . I installed scrapy via I tested a second environment (inspired by @kourosh-amiri-sani ) where I did not install scrapy directly when creating the environment (i.e. without conda-forge), but via pip after activating the environment and it worked. |
There are two problems here. First, the incorrect bytes->str conversion in lxml when printing the error message (and only then). Second, |
The default encoding in While we could change this to some safer default ( |
👍 |
change the env |
Installed using conda on Windows 11 machine. Working through the tutorial and
response.css("title")
Gives this error:
File "src\lxml\parser.pxi", line 1806, in lxml.etree.HTMLParser.init
File "src\lxml\parser.pxi", line 858, in lxml.etree._BaseParser.init
LookupError: unknown encoding: 'b'utf8''
scrapy 2.12.0 py311h1ea47a8_1 conda-forge
twisted 23.10.0 py311haa95532_0
lxml 5.2.1 py311h395c83e_1
The text was updated successfully, but these errors were encountered: