Spider #62

roniemartinez · 2022-03-07T23:22:17Z

Possible ways to implement a simple Spider

~~### Expose "scraper" object in the handler functions~~

@select(css="a")
def get_link(element, scraper):  # <-- pass scraper object
    url = element.get_attribute("href")
    scraper.follow(url)  # <-- add to the URLs that will be scraped by the scraper 
    return {"url": url}

~~### Include the URLs in return~~

@select(css="a")
def get_link(element):
    url = element.get_attribute("href")
    return {"url": url}, [url, ...]  # <- return a tuple of dict result and list of URLs

Final implementation

Just use --follow-urls or pass follow_urls=True to run(). #90
This is less complicated when compared to managing the URLs to crawl by yourself inside the code.

WARNING: Do not use until #27 is implemented as this option will crawl indefinitely and will not save the data.

The text was updated successfully, but these errors were encountered:

roniemartinez added enhancement New feature or request help wanted Extra attention is needed labels Mar 7, 2022

roniemartinez self-assigned this Mar 19, 2022

roniemartinez mentioned this issue Mar 19, 2022

🔗 Follow URLs #90

Merged

4 tasks

roniemartinez closed this as completed in #90 Mar 21, 2022

roniemartinez mentioned this issue Mar 29, 2022

Follow dynamically-built URLs #124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spider #62

Spider #62

roniemartinez commented Mar 7, 2022 •

edited

Loading

Spider #62

Spider #62

Comments

roniemartinez commented Mar 7, 2022 • edited Loading

Possible ways to implement a simple Spider

Final implementation

roniemartinez commented Mar 7, 2022 •

edited

Loading