Link Discovery given a known source #72

marianelamin · 2021-04-24T19:04:36Z

Problem

At some point it would be desirable to scrape data of interest given (only) the base domain/url of the source.

Proposal

This issue is meant to address the design of that component that would find the full url within a website of interest, and return it so it could be use to scrape more data.
In short: given the base url, return a full path url that would take us directly to the page that contains interesting data. For example:

elpitazo.net
    |
    v
| full path retrieval | -> https://elpitazo.net/los-llanos/el-gas-domestico-en-acarigua-araure-cuesta-entre-10-y-20-dolares/

Sub-Objectives

TBD

The text was updated successfully, but these errors were encountered:

asciidiego · 2021-04-24T23:34:42Z

To make sure, is this a scraping or crawling micro-project?

if it's only getting "interesting" links from a given domain, it would be a crawler.
if it's getting data from a (single) URL, it's a scraper.

If it's a combination of the two, I would separate it into two issues, to micro-scope it.

marianelamin · 2021-04-25T01:25:16Z

To make sure, is this a scraping or crawling micro-project?

if it's only getting "interesting" links from a given domain, it would be a crawler.
if it's getting data from a (single) URL, it's a scraper.

If it's a combination of the two, I would separate it into two issues, to micro-scope it.

At this stage of the project, our first goal is to be able to scrape data only from those websites whose URL exist in our current list (in our csv file). So something like this: (which is, by the way, already under construction)

http://some-domain.net/full-path-that-takes-us-to-the-page-that-contains-interesting-data
    | 
    v
| scrape | -> { result: content }

We are relying on the csv file and this is good for now. However, for the non distant future (once our classifier is good shape and trained) it would be an additional value to be able to find current posts (from sources we already know), extract the data and classify it.

This issue is meant to address the design of that component that would find the full url within a website of interest, and return it so it could be use to scrape more data.
In short: given the base url, return a full path url that would take us directly to the page that contains interesting data. For example:

elpitazo.net
    |
    v
| full path retrieval | -> https://elpitazo.net/los-llanos/el-gas-domestico-en-acarigua-araure-cuesta-entre-10-y-20-dolares/

After writing it, this might sound like a crawler though.
wdyt?

asciidiego · 2021-04-25T20:59:25Z

Crawler indeed!

It is important that the crawling project is separated from the scraping project. Two different issues :)

marianelamin · 2021-04-27T02:00:36Z

Crawler indeed!

It is important that the crawling project is separated from the scraping project. Two different issues :)

Totally agree! Will keep the issue in the back log, but move it out of the scraper project.

marianelamin added documentation Improvements or additions to documentation help wanted Extra attention is needed labels Apr 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Link Discovery given a known source #72

Link Discovery given a known source #72

marianelamin commented Apr 24, 2021 •

edited

Loading

asciidiego commented Apr 24, 2021 •

edited

Loading

marianelamin commented Apr 25, 2021 •

edited

Loading

asciidiego commented Apr 25, 2021

marianelamin commented Apr 27, 2021

Link Discovery given a known source #72

Link Discovery given a known source #72

Comments

marianelamin commented Apr 24, 2021 • edited Loading

Problem

Proposal

Sub-Objectives

asciidiego commented Apr 24, 2021 • edited Loading

marianelamin commented Apr 25, 2021 • edited Loading

asciidiego commented Apr 25, 2021

marianelamin commented Apr 27, 2021

marianelamin commented Apr 24, 2021 •

edited

Loading

asciidiego commented Apr 24, 2021 •

edited

Loading

marianelamin commented Apr 25, 2021 •

edited

Loading