-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link Discovery given a known source #72
Comments
To make sure, is this a scraping or crawling micro-project? if it's only getting "interesting" links from a given domain, it would be a crawler. If it's a combination of the two, I would separate it into two issues, to micro-scope it. |
At this stage of the project, our first goal is to be able to scrape data only from those websites whose URL exist in our current list (in our csv file). So something like this: (which is, by the way, already under construction)
We are relying on the csv file and this is good for now. However, for the non distant future (once our classifier is good shape and trained) it would be an additional value to be able to find current posts (from sources we already know), extract the data and classify it. This issue is meant to address the design of that component that would find the full url within a website of interest, and return it so it could be use to scrape more data.
After writing it, this might sound like a crawler though. |
Crawler indeed! It is important that the crawling project is separated from the scraping project. Two different issues :) |
Totally agree! Will keep the issue in the back log, but move it out of the scraper project. |
Problem
At some point it would be desirable to scrape data of interest given (only) the base domain/url of the source.
Proposal
This issue is meant to address the design of that component that would find the full url within a website of interest, and return it so it could be use to scrape more data.
In short: given the base url, return a full path url that would take us directly to the page that contains interesting data. For example:
Sub-Objectives
TBD
The text was updated successfully, but these errors were encountered: