Automated org research tooling #100

rivernews · 2021-09-29T06:03:52Z

Develop a micro service that can research the following:

Logo icon
Size
Found year
Industry
Description
HQ location
Engineering team size, demographic: name, title
Numeric ratings, breakdown ratings
(Don't deal with qualitative review text at this point)

Then, we can have some cronjob to POST data to appl-tracky, and display data in UI.

Scraping in Go

We need golang javascript scraper, this blog sum up some great scrapers.

Using chromedp.
- How to test/run on local - we can't mount volume src code in, but we can copy compiled executable in image and run it.
On AWS - at least we can use docker layer. Like Selenium image.
- Ideally no docker needed
- But seems browser solution all needs container - perhaps because browser binary is going to be platform-specific.

rivernews · 2021-09-30T09:38:27Z

Debugging scraper

Tried locating the nav bar but failed. Specifically looks like the html is corrupted - a \n inside class="..."
- We need to look at the browser directly, ideally open its DevTool and check the DOM tree.
  - Succeeded at local; not sure why SendKeys() failed. But site does randomly verify machine! Not sure if the user agent helps.
- Of course, you can try to spoof the XPath and double check to see if that node is corrupted --> no it's not, OuterHTML and chromedp.Nodes can find it! Just WaitVisible and SendKeys did not work, just hangs.
  - Should it be focused first before sending keys? Why input is not visible? Getting cannot compute box model.
  - Everything with document.querySelector... seems to work, but when chromedp start interacting with it things error, mostly due to cannot compute box model.
Tried navigating to company search result page directly. But seems crashed error code 134.
- Again not sure what's going on, need to see the browser

How to look at the browser?

Ideally for minimal setup, we can 1) download chrome exec 2) latest go, so that we can:
- Compile go
- Run the browser in GUI mode, not headless mode.

rivernews · 2021-10-01T05:43:51Z

Reconsider what to scrape

LinkedIn does not allow scraping, at least within its private pages. If we are to collect employee info, it may be blocked my bot verification check. Of course caching and a best-effort mindset would help, but requires more work and less outcome - which affects our answer to the question: is it worth it going down this route? Because, you can always just visit the site.

But of course, review data (numeric and qualitative) is still relatively easy to retrieve.

Maybe a research hub could be feasible and useful - contains various sections, allow (and expect) some section left empty (due to network issue, page structure change, bot check, etc), while applying caching to minimize scraping. We imagine such research hub should be:

Scraping that expects failure; perhaps slow down to avoid being checked
Display view that tolerates missing data
Caching that minimize scraping
Audit trail - observe change over time

rivernews pinned this issue Sep 29, 2021

rivernews added fun Fun or rewarding to work on user reqeusted labels Sep 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automated org research tooling #100

Automated org research tooling #100

rivernews commented Sep 29, 2021 •

edited

Loading

rivernews commented Sep 30, 2021 •

edited

Loading

rivernews commented Oct 1, 2021 •

edited

Loading

Automated org research tooling #100

Automated org research tooling #100

Comments

rivernews commented Sep 29, 2021 • edited Loading

Scraping in Go

rivernews commented Sep 30, 2021 • edited Loading

Debugging scraper

rivernews commented Oct 1, 2021 • edited Loading

Reconsider what to scrape

rivernews commented Sep 29, 2021 •

edited

Loading

rivernews commented Sep 30, 2021 •

edited

Loading

rivernews commented Oct 1, 2021 •

edited

Loading