Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link Discovery given a known source #72

Open
marianelamin opened this issue Apr 24, 2021 · 4 comments
Open

Link Discovery given a known source #72

marianelamin opened this issue Apr 24, 2021 · 4 comments
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed

Comments

@marianelamin
Copy link
Collaborator

marianelamin commented Apr 24, 2021

Problem

At some point it would be desirable to scrape data of interest given (only) the base domain/url of the source.

Proposal

This issue is meant to address the design of that component that would find the full url within a website of interest, and return it so it could be use to scrape more data.
In short: given the base url, return a full path url that would take us directly to the page that contains interesting data. For example:

elpitazo.net
    |
    v
| full path retrieval | -> https://elpitazo.net/los-llanos/el-gas-domestico-en-acarigua-araure-cuesta-entre-10-y-20-dolares/

Sub-Objectives

TBD

@marianelamin marianelamin added documentation Improvements or additions to documentation help wanted Extra attention is needed labels Apr 24, 2021
@asciidiego
Copy link
Collaborator

asciidiego commented Apr 24, 2021

To make sure, is this a scraping or crawling micro-project?

if it's only getting "interesting" links from a given domain, it would be a crawler.
if it's getting data from a (single) URL, it's a scraper.

If it's a combination of the two, I would separate it into two issues, to micro-scope it.

@marianelamin
Copy link
Collaborator Author

marianelamin commented Apr 25, 2021

To make sure, is this a scraping or crawling micro-project?

if it's only getting "interesting" links from a given domain, it would be a crawler.
if it's getting data from a (single) URL, it's a scraper.

If it's a combination of the two, I would separate it into two issues, to micro-scope it.

At this stage of the project, our first goal is to be able to scrape data only from those websites whose URL exist in our current list (in our csv file). So something like this: (which is, by the way, already under construction)

http://some-domain.net/full-path-that-takes-us-to-the-page-that-contains-interesting-data
    | 
    v
| scrape | -> { result: content }

We are relying on the csv file and this is good for now. However, for the non distant future (once our classifier is good shape and trained) it would be an additional value to be able to find current posts (from sources we already know), extract the data and classify it.

This issue is meant to address the design of that component that would find the full url within a website of interest, and return it so it could be use to scrape more data.
In short: given the base url, return a full path url that would take us directly to the page that contains interesting data. For example:

elpitazo.net
    |
    v
| full path retrieval | -> https://elpitazo.net/los-llanos/el-gas-domestico-en-acarigua-araure-cuesta-entre-10-y-20-dolares/

After writing it, this might sound like a crawler though.
wdyt?

@asciidiego
Copy link
Collaborator

Crawler indeed!

It is important that the crawling project is separated from the scraping project. Two different issues :)

@marianelamin
Copy link
Collaborator Author

Crawler indeed!

It is important that the crawling project is separated from the scraping project. Two different issues :)

Totally agree! Will keep the issue in the back log, but move it out of the scraper project.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants