Skip to content

Commit

Permalink
Only run tut for 2.12 (tpolecat/tut#246)
Browse files Browse the repository at this point in the history
  • Loading branch information
Fristi authored and ruippeixotog committed Sep 18, 2019
1 parent 7625e1c commit abc531b
Show file tree
Hide file tree
Showing 2 changed files with 79 additions and 39 deletions.
5 changes: 3 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,11 @@ addons:
- tor

script:
- sbt ++$TRAVIS_SCALA_VERSION clean coverage test tut
- sbt ++$TRAVIS_SCALA_VERSION clean coverage test

# check if there are no changes after `tut` runs
- if [[ $TRAVIS_SCALA_VERSION =~ '^2\.13.*' ]]; then
- if [[ $TRAVIS_SCALA_VERSION =~ '^2\.12.*' ]]; then
sbt tut
git diff --exit-code;
fi
after_success:
Expand Down
113 changes: 76 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,21 +57,26 @@ import net.ruippeixotog.scalascraper.model._
// import net.ruippeixotog.scalascraper.model._

// Extract the text inside the element with id "header"

doc >> text("#header")
// res2: String = Test page h1
// res0: String = Test page h1

// Extract the <span> elements inside #menu

val items = doc >> elementList("#menu span")
// items: List[net.ruippeixotog.scalascraper.model.Element] = List(JsoupElement(<span><a href="#home">Home</a></span>), JsoupElement(<span><a href="#section1">Section 1</a></span>), JsoupElement(<span class="active">Section 2</span>), JsoupElement(<span><a href="#section3">Section 3</a></span>))

// From each item, extract all the text inside their <a> elements

items.map(_ >> allText("a"))
// res5: List[String] = List(Home, Section 1, "", Section 3)
// res1: List[String] = List(Home, Section 1, "", Section 3)

// From the meta element with "viewport" as its attribute name, extract the

// text in the content attribute

doc >> attr("content")("meta[name=viewport]")
// res8: String = width=device-width, initial-scale=1
// res2: String = width=device-width, initial-scale=1
```

If the element may or may not be in the page, the `>?>` tries to extract the content and returns it wrapped in an `Option`:
Expand All @@ -80,7 +85,7 @@ If the element may or may not be in the page, the `>?>` tries to extract the con
// Extract the element with id "footer" if it exists, return `None` if it
// doesn't:
doc >?> element("#footer")
// res11: Option[net.ruippeixotog.scalascraper.model.Element] =
// res3: Option[net.ruippeixotog.scalascraper.model.Element] =
// Some(JsoupElement(<div id="footer">
// <span>No copyright 2014</span>
// </div>))
Expand Down Expand Up @@ -144,15 +149,17 @@ Some usage examples:
```scala
// Extract the date from the "#date" element
doc >> extractor("#date", text, asLocalDate("yyyy-MM-dd"))
// res17: org.joda.time.LocalDate = 2014-10-26
// res5: org.joda.time.LocalDate = 2014-10-26

// Extract the text of all "#mytable td" elements and parse each of them as a number

doc >> extractor("#mytable td", texts, seq(asDouble))
// res19: TraversableOnce[Double] = non-empty iterator
// res6: TraversableOnce[Double] = List(3.0, 15.0, 15.0, 1.0)

// Extract an element "h1" and do no parsing (the default parsing behavior)

doc >> extractor("h1", element, asIs[Element])
// res21: net.ruippeixotog.scalascraper.model.Element = JsoupElement(<h1>Test page h1</h1>)
// res7: net.ruippeixotog.scalascraper.model.Element = JsoupElement(<h1>Test page h1</h1>)
```

With the help of the implicit conversions provided by the DSL, we can write more succinctly the most common extraction cases:
Expand All @@ -166,8 +173,8 @@ Because of that, one can write the expressions in the Quick Start section, as we
```scala
// Extract all the "h3" elements (as a lazy iterable)
doc >> "h3"
// res23: net.ruippeixotog.scalascraper.model.ElementQuery[net.ruippeixotog.scalascraper.model.Element] =
// LazyElementQuery(WrappedArray(h3), JsoupElement(<html lang="en">
// res8: net.ruippeixotog.scalascraper.model.ElementQuery[net.ruippeixotog.scalascraper.model.Element] =
// LazyElementQuery(ArraySeq(h3), JsoupElement(<html lang="en">
// <head>
// <meta charset="utf-8">
// <meta name="viewport" content="width=device-width, initial-scale=1">
Expand All @@ -188,19 +195,23 @@ doc >> "h3"
// <h2>Test page h2</h2>
// <span id="date">2014-10-26</span>
// <span id="datefull">2014-10-26T12:30:05Z</span>
// <span id="rating">4....
// <span id="rating">4.5</span>
// <span id="pages">2...

// Extract all text inside this document

doc >> allText
// res25: String = Test page Test page h1 Home Section 1 Section 2 Section 3 Test page h2 2014-10-26 2014-10-26T12:30:05Z 4.5 2 Section 1 h3 Some text for testing More text for testing Section 2 h3 My Form Add field Section 3 h3 3 15 15 1 No copyright 2014
// res9: String = Test page Test page h1 Home Section 1 Section 2 Section 3 Test page h2 2014-10-26 2014-10-26T12:30:05Z 4.5 2 Section 1 h3 Some text for testing More text for testing Section 2 h3 My Form Add field Section 3 h3 3 15 15 1 No copyright 2014

// Extract the elements with class ".active"

doc >> elementList(".active")
// res27: List[net.ruippeixotog.scalascraper.model.Element] = List(JsoupElement(<span class="active">Section 2</span>))
// res10: List[net.ruippeixotog.scalascraper.model.Element] = List(JsoupElement(<span class="active">Section 2</span>))

// Extract the text inside each "p" element

doc >> texts("p")
// res29: Iterable[String] = List(Some text for testing, More text for testing)
// res11: Iterable[String] = List(Some text for testing, More text for testing)
```

## Content Validation
Expand Down Expand Up @@ -234,7 +245,7 @@ Some validation examples:
```scala
// Check if the title of the page is "Test page"
doc >/~ validator(text("title"))(_ == "Test page")
// res31: Either[Unit,browser.DocumentType] =
// res12: Either[Unit,browser.DocumentType] =
// Right(JsoupDocument(<!doctype html>
// <html lang="en">
// <head>
Expand Down Expand Up @@ -264,12 +275,14 @@ doc >/~ validator(text("title"))(_ == "Test page")
// <p>Some text ...

// Check if there are at least 3 ".active" elements

doc >/~ validator(".active")(_.size >= 3)
// res33: Either[Unit,browser.DocumentType] = Left(())
// res13: Either[Unit,browser.DocumentType] = Left(())

// Check if the text in ".desc" contains the word "blue"

doc >/~ validator(allText("#mytable"))(_.contains("blue"))
// res35: Either[Unit,browser.DocumentType] = Left(())
// res14: Either[Unit,browser.DocumentType] = Left(())
```

When a document fails a validation, it may be useful to identify the problem by pattern-matching it against common scraping pitfalls, such as a login page that appears unexpectedly because of an expired cookie, dynamic content that disappeared or server-side errors. If we define validators for both the success case and error cases:
Expand All @@ -287,7 +300,7 @@ They can be used in combination to create more informative validations:

```scala
doc >/~ (succ, errors)
// res37: Either[String,browser.DocumentType] = Left(Too few items)
// res15: Either[String,browser.DocumentType] = Left(Too few items)
```

Validators matching errors were constructed above using an additional `result` parameter after the extractor. That value is returned wrapped in a `Left` if that particular error occurs during a validation.
Expand All @@ -299,7 +312,7 @@ As shown before in the Quick Start section, one can try if an extractor works in
```scala
// Try to extract an element with id "optional", return `None` if none exist
doc >?> element("#optional")
// res39: Option[net.ruippeixotog.scalascraper.model.Element] = None
// res16: Option[net.ruippeixotog.scalascraper.model.Element] = None
```

Note that when using `>?>` with content extractors that return sequences, such as `texts` and `elements`, `None` will never be returned (`Some(Seq())` will be returned instead).
Expand All @@ -309,40 +322,45 @@ If you want to use multiple extractors in a single document or element, you can
```scala
// Extract the text of the title element and all inputs of #myform
doc >> (text("title"), elementList("#myform input"))
// res41: (String, List[net.ruippeixotog.scalascraper.model.Element]) = (Test page,List(JsoupElement(<input type="text" name="name" value="John">), JsoupElement(<input type="text" name="address">), JsoupElement(<input type="submit" value="Submit">)))
// res17: (String, List[net.ruippeixotog.scalascraper.model.Element]) = (Test page,List(JsoupElement(<input type="text" name="name" value="John">), JsoupElement(<input type="text" name="address">), JsoupElement(<input type="submit" value="Submit">)))
```

The extraction operators work on `List`, `Option`, `Either` and other instances for which a [Scalaz](https://github.com/scalaz/scalaz) `Functor` instance exists. The extraction occurs by mapping over the functors:

```scala
// Extract the titles of all documents in the list
List(doc, doc) >> text("title")
// res43: List[String] = List(Test page, Test page)
// res18: List[String] = List(Test page, Test page)

// Extract the title if the document is a `Some`

Option(doc) >> text("title")
// res45: Option[String] = Some(Test page)
// res19: Option[String] = Some(Test page)
```

You can apply other extractors and validators to the result of an extraction, which is particularly powerful combined with the feature shown above:

```scala
// From the "#menu" element, extract the text in the ".active" element inside
doc >> element("#menu") >> text(".active")
// res47: String = Section 2
// res20: String = Section 2

// Same as above, but in a scenario where "#menu" can be absent

doc >?> element("#menu") >> text(".active")
// res49: Option[String] = Some(Section 2)
// res21: Option[String] = Some(Section 2)

// Same as above, but check if the "#menu" has any "span" element before

// extracting the text

doc >?> element("#menu") >/~ validator("span")(_.nonEmpty) >> text(".active")
// res52: Option[scala.util.Either[Unit,String]] = Some(Right(Section 2))
// res22: Option[scala.util.Either[Unit,String]] = Some(Right(Section 2))

// Extract the links inside all the "#menu > span" elements

doc >> elementList("#menu > span") >?> attr("href")("a")
// res54: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3))
// res23: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3))
```

This library also provides a `Functor` for `HtmlExtractor`, making it possible to map over extractors and create chained extractors that can be passed around and stored like objects. For example, new extractors can be defined like this:
Expand Down Expand Up @@ -371,40 +389,44 @@ And they can be used just as extractors created using other means provided by th

```scala
doc >> spanLinks
// res60: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3), None, None, None, None, None, Some(#), None)
// res24: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3), None, None, None, None, None, Some(#), None)

doc >> spanLinksCount
// res61: Int = 4
// res25: Int = 4

doc >> menuLinks
// res62: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3))
// res26: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3))
```

Just remember that you can only apply extraction operators `>>` and `>?>` to documents, elements or functors "containing" them, which means that the following is a compile-time error:

```scala
// The `texts` extractor extracts a list of strings and extractors cannot be

// applied to strings

doc >> texts("#menu > span") >> "a"
// <console>:30: error: value >> is not a member of Iterable[String]
// doc >> texts("#menu > span") >> "a"
// ^
// On line 2: error: value >> is not a member of Iterable[String]
```

Finally, if you prefer not using operators for the sake of code legibility, you can use alternative methods:

```scala
// `extract` is the same as `>>`
doc extract text("title")
// res67: String = Test page
// res28: String = Test page

// `tryExtract` is the same as `>?>`

doc tryExtract element("#optional")
// res69: Option[net.ruippeixotog.scalascraper.model.Element] = None
// res29: Option[net.ruippeixotog.scalascraper.model.Element] = None

// `validateWith` is the same as `>/~`

doc validateWith (succ, errors)
// res71: Either[String,browser.DocumentType] = Left(Too few items)
// res30: Either[String,browser.DocumentType] = Left(Too few items)
```

## Using Browser-Specific Features
Expand Down Expand Up @@ -441,39 +463,56 @@ Note that extracting using CSS queries also keeps the concrete types of the elem
```scala
// same thing as above
typedDoc >> "#menu" >> "span:nth-child(2)" >> "a" >> pElement
// res78: net.ruippeixotog.scalascraper.dsl.DSL.Extract.pElement.Out[net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.HtmlUnitElement] = HtmlUnitElement(HtmlAnchor[<a href="#section1">])
// res31: net.ruippeixotog.scalascraper.dsl.DSL.Extract.pElement.Out[net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.HtmlUnitElement] = HtmlUnitElement(HtmlAnchor[<a href="#section1">])
```

Concrete element types, like `HtmlUnitElement`, expose a public `underlying` field with the underlying element object used by the browser backend. In the case of HtmlUnit, that would be a [`DomElement`](http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/html/DomElement.html), which exposes a whole new range of operations:

```scala
// extract the current "href" this "a" element points to
aElem >> attr("href")
// res80: String = #section1
// res32: String = #section1

// use `underlying` to update the "href" attribute

aElem.underlying.setAttribute("href", "#section1_2")

// verify that "href" was updated

aElem >> attr("href")
// res84: String = #section1_2
// res34: String = #section1_2

// get the location of the document (without the host and the full path parts)

typedDoc.location.split("/").last
// res86: String = example.html
// res35: String = example.html

def click(elem: HtmlUnitElement) {
// def click(elem: HtmlUnitElement) {
// ^
// On line 2: warning: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `click`'s return type
// the type param may be needed, as the original API uses Java wildcards
// def click(elem: HtmlUnitElement) {
// ^
// On line 2: warning: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `click`'s return type
aElem.underlying.click[com.gargoylesoftware.htmlunit.Page]()
// def click(elem: HtmlUnitElement) {
// ^
// On line 2: warning: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `click`'s return type
}
// def click(elem: HtmlUnitElement) {
// ^
// On line 2: warning: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `click`'s return type
// click: (elem: net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.HtmlUnitElement)Unit

// simulate a click on our recently modified element

click(aElem)

// check the new location

typedDoc.location.split("/").last
// res90: String = example.html#section1_2
// res37: String = example.html#section1_2
```

Using the typed element API provides much more flexibility when more than querying elements is required. However, one should avoid using it unless strictly necessary, as:
Expand Down

0 comments on commit abc531b

Please sign in to comment.