Creeper is a next-generation crawler which fetches web page by creeper script. As a cross-platform embedded crawler, you can use it for your news app, subscribe program, etc.
Warning: At present this project is still under early stage development, please do not use in the production environment.
$ go get github.com/wspl/creeper
Create hacker_news.crs
page(@page=1) = "https://news.ycombinator.com/news?p={@page}"
news[]: page -> $("tr.athing")
title: $(".title a.storylink").text
site: $(".title span.sitestr").text
link: $(".title a.storylink").href
Then, create main.go
package main
import "github.com/wspl/creeper"
func main() {
c := creeper.Open("./hacker_news.crs")
c.Array("news").Each(func(c *creeper.Creeper) {
println("title: ", c.String("title"))
println("site: ", c.String("site"))
println("link: ", c.String("link"))
println("===")
})
}
Build and run. Console will print something like:
title: Samsung chief Lee arrested as S.Korean corruption probe deepens
site: reuters.com
link: http://www.reuters.com/article/us-southkorea-politics-samsung-group-idUSKBN15V2RD
===
title: ReactOS 0.4.4 Released
site: reactos.org
link: https://reactos.org/project-news/reactos-044-released
===
title: FeFETs: How this new memory stacks up against existing non-volatile memory
site: semiengineering.com
link: http://semiengineering.com/what-are-fefets/
Town is a lambda like expression for saving (in)mutable string. Most of the time, we used it to store url.
page(@page=1, ext) = "https://news.ycombinator.com/news?p={@page}&ext={ext}"
When you need town, use it as if you were calling a function:
news[]: page(ext="Hello World!") -> $("tr.athing")
You might have noticed that the @page
parameter is not used. Yeah, it is a special parameter.
Expression in town definition line like name="something"
, represents parameter name
has a default value "something"
.
Incidentally, @page
is a parameter that will automatically increasing when current page has no more content.
Nodes are tree structure that represent the data structure you are going to crawl.
news[]: page -> $("tr.athing")
title: $(".title a.storylink").text
site: $(".title span.sitestr").text
link: $(".title a.storylink").href
Like yaml
, nodes distinguishes the hierarchy by indentation.
Node has name. title
is a field name, represents a general string data. news[]
is a array name, represents a parent structure with multiple sub-data.
Page indicates where to fetching the field data. It can be a town expression or field reference.
Field reference is a advanced usage of Node, you can found the details in ./eh.crs.
If a node owned page and fun at the same time, page should on the left of ->
, fun should on the right of ->
. Which is page -> fun
Fun represents the data processing process.
There are all supported funs:
Name | Parameters | Description |
---|---|---|
$ | (selector: string) | Relative CSS selector (select from parent node) |
$root | (selector: string) | Absolute CSS selector (select from body) |
html | inner HTML | |
text | inner text | |
outerHTML | outer HTML | |
attr | (attr: string) | attribute value |
style | style attribute value | |
href | href attribute value | |
src | src attribute value | |
class | class attribute value | |
id | id attribute value | |
calc | (prec: int) | calculate arithmetic expression |
match | (regexp: string) | match first sub-string via regular expression |
expand | (regexp: string, target: string) | expand matched strings to target string |
Plutonist