Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated org research tooling #100

Open
1 task done
rivernews opened this issue Sep 29, 2021 · 2 comments
Open
1 task done

Automated org research tooling #100

rivernews opened this issue Sep 29, 2021 · 2 comments
Labels
fun Fun or rewarding to work on user reqeusted

Comments

@rivernews
Copy link
Owner

rivernews commented Sep 29, 2021

Develop a micro service that can research the following:

  • Logo icon
  • Size
  • Found year
  • Industry
  • Description
  • HQ location
  • Engineering team size, demographic: name, title
  • Numeric ratings, breakdown ratings
  • (Don't deal with qualitative review text at this point)

Then, we can have some cronjob to POST data to appl-tracky, and display data in UI.

Scraping in Go

We need golang javascript scraper, this blog sum up some great scrapers.

  • Using chromedp.
    • How to test/run on local - we can't mount volume src code in, but we can copy compiled executable in image and run it.
  • On AWS - at least we can use docker layer. Like Selenium image.
    • Ideally no docker needed
    • But seems browser solution all needs container - perhaps because browser binary is going to be platform-specific.
@rivernews rivernews pinned this issue Sep 29, 2021
@rivernews rivernews added fun Fun or rewarding to work on user reqeusted labels Sep 29, 2021
@rivernews
Copy link
Owner Author

rivernews commented Sep 30, 2021

Debugging scraper

  • Tried locating the nav bar but failed. Specifically looks like the html is corrupted - a \n inside class="..."
    • We need to look at the browser directly, ideally open its DevTool and check the DOM tree.
      • Succeeded at local; not sure why SendKeys() failed. But site does randomly verify machine! Not sure if the user agent helps.
    • Of course, you can try to spoof the XPath and double check to see if that node is corrupted --> no it's not, OuterHTML and chromedp.Nodes can find it! Just WaitVisible and SendKeys did not work, just hangs.
      • Should it be focused first before sending keys? Why input is not visible? Getting cannot compute box model.
      • Everything with document.querySelector... seems to work, but when chromedp start interacting with it things error, mostly due to cannot compute box model.
  • Tried navigating to company search result page directly. But seems crashed error code 134.
    • Again not sure what's going on, need to see the browser

How to look at the browser?

  • Ideally for minimal setup, we can 1) download chrome exec 2) latest go, so that we can:
    • Compile go
    • Run the browser in GUI mode, not headless mode.

@rivernews
Copy link
Owner Author

rivernews commented Oct 1, 2021

Reconsider what to scrape

LinkedIn does not allow scraping, at least within its private pages. If we are to collect employee info, it may be blocked my bot verification check. Of course caching and a best-effort mindset would help, but requires more work and less outcome - which affects our answer to the question: is it worth it going down this route? Because, you can always just visit the site.

But of course, review data (numeric and qualitative) is still relatively easy to retrieve.

Maybe a research hub could be feasible and useful - contains various sections, allow (and expect) some section left empty (due to network issue, page structure change, bot check, etc), while applying caching to minimize scraping. We imagine such research hub should be:

  • Scraping that expects failure; perhaps slow down to avoid being checked
  • Display view that tolerates missing data
  • Caching that minimize scraping
  • Audit trail - observe change over time

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
fun Fun or rewarding to work on user reqeusted
Projects
None yet
Development

No branches or pull requests

1 participant