Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

请教一下为什么爬出来的数据80%以上都是图书? #10

Closed
Jiawen-Yan opened this issue Feb 28, 2018 · 15 comments
Closed

请教一下为什么爬出来的数据80%以上都是图书? #10

Jiawen-Yan opened this issue Feb 28, 2018 · 15 comments

Comments

@Jiawen-Yan
Copy link

连续运行了一天,得到了几十万条去重后的商品,但发现80%都是图书。

Log中也有很多 not book item 的parse error

请问一下原因,谢谢

@ramsayleung
Copy link
Owner

ramsayleung commented Feb 28, 2018

Log里面有很多 not book item 的parse error,是因为京东不同的子域名对应的页面解析规则不一样,比如图书的不同的页面的解析方式可能是不一样,图书和全球购的的解析方式又是不一样的。我当时只能针对不同的页面写不同的解析策略,但是只能穷举我遇到的,那些后来出现的新的解析页面,我就只能记下日志了。详见https://github.com/samrayleung/jd_spider/blob/bcb8ba6eaae10e09cda105fe2d9de153c004e77d/jd/jd/spiders/jd.py#L65。 至于为什么80% 都是图书,这个当时我写这个项目的时候似乎是没有这个问题的。现在看来应该要重写优化一下代码了, 见https://github.com/samrayleung/jd_spider/blob/master/README.org#todo : )

@Jiawen-Yan
Copy link
Author

非常感谢,但我在jd 网页的源码还是能看到 @Class="parameter2 p-parameter-list" 这个tag,但spiders似乎解析不出来

Kind remainders:
(1) 将 https 都换成 http 可以提高成功率
(2) 将 jd.m 的移动端是解析不出来的,也可以改成http://item.jd.com/ProductID

我重写了使用sqlite3 的Pipeline; 去重问题,我这里用sqlite3 直接加了Unique限制,测试没问题之后会contribute代码出来。

Thanks a lot

@ramsayleung
Copy link
Owner

ramsayleung commented Feb 28, 2018

你可以查看更多的jd 网页,你会发现不同商品的的属性值对应的HTML 可能不一样,比如那些下线的图书和新上架的图书的属性不一样,比如出版社 作家等,而图书和手机等商品的属性的HTML又不一样,所以解析的时候就很麻烦. :(

@ramsayleung
Copy link
Owner

关于 Parser Error 的问题,其实京东是有无法访问的商品页面的,比如 https://cartv.jd.com/item/200105015622.html,对于这种页面就应该直接跳过,所以我新增了判断逻辑来优化解析页面的问题.

@websec123
Copy link

这个项目现在还能够运行?

@ramsayleung
Copy link
Owner

在提issue 之前,你有运行过这个项目么?

@sunfeng90
Copy link

我运行项目的时候,报错:No module named statscol.graphite。

具体的:

scrapy crawl jindong
2018-04-15 23:24:45 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: jd)
2018-04-15 23:24:45 [scrapy.utils.log] INFO: Versions: lxml 3.8.0.0, libxml2 2.9.4, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 11.1.0, Python 2.7.13 |Anaconda, Inc.| (default, Sep 21 2017, 17:38:20) - [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2l 25 May 2017), cryptography 2.0.3, Platform Darwin-17.5.0-x86_64-i386-64bit
2018-04-15 23:24:45 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jd.spiders', 'SPIDER_MODULES': ['jd.spiders'], 'CONCURRENT_REQUESTS_PER_DOMAIN': 16, 'DUPEFILTER_CLASS': 'scrapy_redis.dupefilter.RFPDupeFilter', 'CONCURRENT_REQUESTS': 32, 'BOT_NAME': 'jd', 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler', 'STATS_CLASS': 'jd.statscol.graphite.RedisGraphiteStatsCollector'}
Traceback (most recent call last):
File "/Users/sunfeng/anaconda2/bin/scrapy", line 11, in
sys.exit(execute())
File "/Users/sunfeng/anaconda2/lib/python2.7/site-packages/scrapy/cmdline.py", line 150, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "/Users/sunfeng/anaconda2/lib/python2.7/site-packages/scrapy/cmdline.py", line 90, in _run_print_help
func(*a, **kw)
File "/Users/sunfeng/anaconda2/lib/python2.7/site-packages/scrapy/cmdline.py", line 157, in _run_command
cmd.run(args, opts)
File "/Users/sunfeng/anaconda2/lib/python2.7/site-packages/scrapy/commands/crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "/Users/sunfeng/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 170, in crawl
crawler = self.create_crawler(crawler_or_spidercls)
File "/Users/sunfeng/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 198, in create_crawler
return self._create_crawler(crawler_or_spidercls)
File "/Users/sunfeng/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 203, in _create_crawler
return Crawler(spidercls, self.settings)
File "/Users/sunfeng/anaconda2/lib/python2.7/site-packages/scrapy/crawler.py", line 41, in init
self.stats = load_object(self.settings['STATS_CLASS'])(self)
File "/Users/sunfeng/anaconda2/lib/python2.7/site-packages/scrapy/utils/misc.py", line 44, in load_object
mod = import_module(module)
File "/Users/sunfeng/anaconda2/lib/python2.7/importlib/init.py", line 37, in import_module
import(name)
ImportError: No module named statscol.graphite

@ramsayleung
Copy link
Owner

你用的是python2, 这个项目是python3 的。此外,在运行之前,请确保你已经安装必要的依赖

@Jiawen-Yan
Copy link
Author

@sunfeng90 如果不是必须使用statscol.graphite,注释掉Setting中的代码后,稍微修改一些语法,该项目在Python2/3都可以运行

@Jiawen-Yan
Copy link
Author

该代码设计时默认采用了类似于深度优先策略,如果短时间爬取,会陷入一种类型商品中。如果超1000小时爬取,基本不存在这样的问题。

如果短时间内想采集各种类型的商品,可以人为限制商品category 代码范围

@sunfeng90
Copy link

@CHARLESYAN1 如何修改?

@xudong-jason
Copy link

你好,我想请教下爬虫里面的数据存储问题,在pipelines文件中没有指定表名,如何存储到mongodb中呢,是不是需要先再mongodb中先创建db和Collection才行

@ramsayleung
Copy link
Owner

你好,我想请教下爬虫里面的数据存储问题,在pipelines文件中没有指定表名,如何存储到mongodb中呢,是不是需要先再mongodb中先创建db和Collection才行

你代码看得不够仔细呢~
https://github.com/samrayleung/jd_spider/blob/9f8ada632cb5fd5a31dccdbb0fd49b33f03762ca/jd/jd/pipelines.py#L26

https://github.com/samrayleung/jd_spider/blob/9f8ada632cb5fd5a31dccdbb0fd49b33f03762ca/jd/jd/settings.py#L80

@xudong-jason
Copy link

好的,这个问题已经解决了,非常感谢!现在在跑评论这个爬虫的时候有遇到问题了,出现了错误,请帮忙看看。谢谢!
['jd_comment.pipelines.MongoDBPipeline', 'scrapy_redis.pipelines.RedisPipeline']
2018-04-26 11:56:49 [scrapy.core.engine] INFO: Spider opened
2018-04-26 11:56:50 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-26 11:56:50 [jd_comment] INFO: Spider opened: jd_comment
2018-04-26 11:56:50 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-04-26 11:56:50 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-26 11:56:50 [scrapy.core.engine] ERROR: Stats close failure
Traceback (most recent call last):
File "C:\Users\lxd\Anaconda3\lib\site-packages\twisted\internet\defer.py", line 653, in runCallbacks
current.result = callback(current.result, *args, **kw)
File "C:\Users\lxd\Anaconda3\lib\site-packages\scrapy\core\engine.py", line 321, in
dfd.addBoth(lambda : self.crawler.stats.close_spider(spider, reason=reason))
File "C:\Users\lxd\PycharmProjects\jd_spider-1\jd_comment\jd_comment\statscol\graphite.py", line 147, in close_spider
spider=spider)
File "C:\Users\lxd\Anaconda3\lib\logging_init
.py", line 1885, in debug
root.debug(msg, *args, **kwargs)
File "C:\Users\lxd\Anaconda3\lib\logging_init
.py", line 1294, in debug
self._log(DEBUG, msg, args, **kwargs)
TypeError: _log() got an unexpected keyword argument 'spider'
2018-04-26 11:56:50 [scrapy.core.engine] INFO: Spider closed (finished)

@ramsayleung
Copy link
Owner

建议新开一条issue, 而不是像现在这样"搭车"~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants