scrapy-examples

Multifarious scrapy examples with integrated proxies and agents, which make you comfy to write a spider.

Don't use it to do anything illegal!

Real spider example: doubanbook

Tutorial

git clone https://github.com/geekan/scrapy-examples
cd scrapy-examples/doubanbook
scrapy crawl doubanbook

Depth

There are several depths in the spider, and the spider gets real data from depth2.

Depth0: The entrance is http://book.douban.com/tag/
Depth1: Urls like http://book.douban.com/tag/外国文学 from depth0
Depth2: Urls like http://book.douban.com/subject/1770782/ from depth1

Example image

Avaiable Spiders

tutorial
- dmoz_item
- douban_book
- page_recorder
- douban_tag_book
doubanbook
linkedin
hrtencent
sis
zhihu
alexa
- alexa
- alexa.cn

Advanced

Use parse_with_rules to write a spider quickly.
See dmoz spider for more details.
Proxies
- If you don't want to use proxy, just comment the proxy middleware in settings.
- If you want to custom it, hack misc/proxy.py by yourself.
Notice
- Don't use parse as your method name, it's an inner method of CrawlSpider.

Advanced Usage

Run ./startproject.sh <PROJECT> to start a new project.
It will automatically generate most things, the only left things are:
- PROJECT/PROJECT/items.py
- PROJECT/PROJECT/spider/spider.py

Example to hack `items.py` and `spider.py`

Hacked items.py with additional fields url and description:

from scrapy.item import Item, Field

class exampleItem(Item):
    url = Field()
    name = Field()
    description = Field()

Hacked spider.py with start rules and css rules (here only display the class exampleSpider):

class exampleSpider(CommonSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.com/",
    ]
    # Crawler would start on start_urls, and follow the valid urls allowed by below rules.
    rules = [
        Rule(sle(allow=["/Arts/", "/Games/"]), callback='parse', follow=True),
    ]

    css_rules = {
        '.directory-url li': {
            '__use': 'dump', # dump data directly
            '__list': True, # it's a list
            'url': 'li > a::attr(href)',
            'name': 'a::text',
            'description': 'li::text',
        }
    }

    def parse(self, response):
        info('Parse '+response.url)
        # parse_with_rules is implemented here:
        #   https://github.com/geekan/scrapy-examples/blob/master/misc/spider.py
        self.parse_with_rules(response, self.css_rules, exampleItem)

Name	Name	Last commit message	Last commit date
Latest commit geekan Merge pull request #15 from bryant1410/master Aug 13, 2018 edb1cb1 · Aug 13, 2018 History 269 Commits
alexa	alexa	升级scrapy1.0版本	May 12, 2016
alexa_topsites	alexa_topsites	add alexa_topsites	Jun 14, 2016
amazonbook	amazonbook	升级scrapy1.0版本	May 12, 2016
dianping	dianping	update	Nov 6, 2016
dmoz	dmoz	升级scrapy1.0版本	May 12, 2016
doubanbook	doubanbook	升级scrapy1.0版本	May 12, 2016
doubanmovie	doubanmovie	升级scrapy1.0版本	May 12, 2016
douyu	douyu	write result to file.	Jun 13, 2016
general_spider	general_spider	add return x	Jun 14, 2016
github_trending	github_trending	add github_trending repo.	Jun 13, 2016
googlescholar	googlescholar	delay.sh: update download delay.	Jun 12, 2016
hacker_news	hacker_news	add hacker_news	Jun 14, 2016
hrtencent	hrtencent	升级scrapy1.0版本	May 12, 2016
linkedin	linkedin	升级scrapy1.0版本	May 12, 2016
misc	misc	Update auto_join_text to False to make all actions more uniform.	Jun 12, 2016
pandatv	pandatv	add pandatv	Jun 13, 2016
proxylist	proxylist	升级scrapy1.0版本	May 12, 2016
qqnews	qqnews	升级scrapy1.0版本	May 12, 2016
reddit	reddit	add reddit	Jun 14, 2016
sinanews	sinanews	升级scrapy1.0版本	May 12, 2016
sis	sis	json: update	Jun 12, 2016
template	template	update template.	Jun 12, 2016
tutorial	tutorial	升级scrapy1.0版本	May 12, 2016
underdev	underdev	add twitch as doing	Jun 12, 2016
v2ex	v2ex	add v2ex list extraction.	Jun 12, 2016
youtube_trending	youtube_trending	add youtube_trending	Jun 13, 2016
zhibo8	zhibo8	deleted: zhibo8/index.html	Jun 14, 2016
zhihu	zhihu	升级scrapy1.0版本	May 12, 2016
ziroom	ziroom	add an example for crawl ziroom	Nov 25, 2016
.gitignore	.gitignore	update .gitignore	Dec 2, 2015
README.md	README.md	Fix broken Markdown headings	Apr 18, 2017
clean.sh	clean.sh	clean.sh chmod +x	Nov 16, 2014
delay.sh	delay.sh	delay.sh: update download delay.	Jun 12, 2016
startproject.sh	startproject.sh	not print useless msg	Nov 26, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

scrapy-examples

Real spider example: doubanbook

Tutorial

Depth

Example image

Avaiable Spiders

Advanced

Advanced Usage

Example to hack `items.py` and `spider.py`

About

Releases

Packages

Contributors 7

Languages

geekan/scrapy-examples

Folders and files

Latest commit

History

Repository files navigation

scrapy-examples

Real spider example: doubanbook

Tutorial

Depth

Example image

Avaiable Spiders

Advanced

Advanced Usage

Example to hack items.py and spider.py

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Languages

Example to hack `items.py` and `spider.py`

Packages