Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Scala 2.13 #67

Closed
wants to merge 12 commits into from
7 changes: 4 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,21 @@ sudo: false

language: scala
scala:
- 2.11.12
- 2.12.8
- 2.13.0
jdk:
- oraclejdk8
- openjdk8
addons:
apt:
packages:
- tor

script:
- sbt ++$TRAVIS_SCALA_VERSION clean coverage test tut
- sbt ++$TRAVIS_SCALA_VERSION clean coverage test

# check if there are no changes after `tut` runs
- if [[ $TRAVIS_SCALA_VERSION =~ '^2\.12.*' ]]; then
sbt tut
Fristi marked this conversation as resolved.
Show resolved Hide resolved
git diff --exit-code;
fi
after_success:
Expand Down
113 changes: 76 additions & 37 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,21 +57,26 @@ import net.ruippeixotog.scalascraper.model._
// import net.ruippeixotog.scalascraper.model._

// Extract the text inside the element with id "header"

doc >> text("#header")
// res2: String = Test page h1
// res0: String = Test page h1

// Extract the <span> elements inside #menu

val items = doc >> elementList("#menu span")
// items: List[net.ruippeixotog.scalascraper.model.Element] = List(JsoupElement(<span><a href="#home">Home</a></span>), JsoupElement(<span><a href="#section1">Section 1</a></span>), JsoupElement(<span class="active">Section 2</span>), JsoupElement(<span><a href="#section3">Section 3</a></span>))

// From each item, extract all the text inside their <a> elements

items.map(_ >> allText("a"))
// res5: List[String] = List(Home, Section 1, "", Section 3)
// res1: List[String] = List(Home, Section 1, "", Section 3)

// From the meta element with "viewport" as its attribute name, extract the

// text in the content attribute

doc >> attr("content")("meta[name=viewport]")
// res8: String = width=device-width, initial-scale=1
// res2: String = width=device-width, initial-scale=1
```

If the element may or may not be in the page, the `>?>` tries to extract the content and returns it wrapped in an `Option`:
Expand All @@ -80,7 +85,7 @@ If the element may or may not be in the page, the `>?>` tries to extract the con
// Extract the element with id "footer" if it exists, return `None` if it
// doesn't:
doc >?> element("#footer")
// res11: Option[net.ruippeixotog.scalascraper.model.Element] =
// res3: Option[net.ruippeixotog.scalascraper.model.Element] =
// Some(JsoupElement(<div id="footer">
// <span>No copyright 2014</span>
// </div>))
Expand Down Expand Up @@ -144,15 +149,17 @@ Some usage examples:
```scala
// Extract the date from the "#date" element
doc >> extractor("#date", text, asLocalDate("yyyy-MM-dd"))
// res17: org.joda.time.LocalDate = 2014-10-26
// res5: org.joda.time.LocalDate = 2014-10-26

// Extract the text of all "#mytable td" elements and parse each of them as a number

doc >> extractor("#mytable td", texts, seq(asDouble))
// res19: TraversableOnce[Double] = non-empty iterator
// res6: TraversableOnce[Double] = List(3.0, 15.0, 15.0, 1.0)

// Extract an element "h1" and do no parsing (the default parsing behavior)

doc >> extractor("h1", element, asIs[Element])
// res21: net.ruippeixotog.scalascraper.model.Element = JsoupElement(<h1>Test page h1</h1>)
// res7: net.ruippeixotog.scalascraper.model.Element = JsoupElement(<h1>Test page h1</h1>)
```

With the help of the implicit conversions provided by the DSL, we can write more succinctly the most common extraction cases:
Expand All @@ -166,8 +173,8 @@ Because of that, one can write the expressions in the Quick Start section, as we
```scala
// Extract all the "h3" elements (as a lazy iterable)
doc >> "h3"
// res23: net.ruippeixotog.scalascraper.model.ElementQuery[net.ruippeixotog.scalascraper.model.Element] =
// LazyElementQuery(WrappedArray(h3), JsoupElement(<html lang="en">
// res8: net.ruippeixotog.scalascraper.model.ElementQuery[net.ruippeixotog.scalascraper.model.Element] =
// LazyElementQuery(ArraySeq(h3), JsoupElement(<html lang="en">
// <head>
// <meta charset="utf-8">
// <meta name="viewport" content="width=device-width, initial-scale=1">
Expand All @@ -188,19 +195,23 @@ doc >> "h3"
// <h2>Test page h2</h2>
// <span id="date">2014-10-26</span>
// <span id="datefull">2014-10-26T12:30:05Z</span>
// <span id="rating">4....
// <span id="rating">4.5</span>
// <span id="pages">2...

// Extract all text inside this document

doc >> allText
// res25: String = Test page Test page h1 Home Section 1 Section 2 Section 3 Test page h2 2014-10-26 2014-10-26T12:30:05Z 4.5 2 Section 1 h3 Some text for testing More text for testing Section 2 h3 My Form Add field Section 3 h3 3 15 15 1 No copyright 2014
// res9: String = Test page Test page h1 Home Section 1 Section 2 Section 3 Test page h2 2014-10-26 2014-10-26T12:30:05Z 4.5 2 Section 1 h3 Some text for testing More text for testing Section 2 h3 My Form Add field Section 3 h3 3 15 15 1 No copyright 2014

// Extract the elements with class ".active"

doc >> elementList(".active")
// res27: List[net.ruippeixotog.scalascraper.model.Element] = List(JsoupElement(<span class="active">Section 2</span>))
// res10: List[net.ruippeixotog.scalascraper.model.Element] = List(JsoupElement(<span class="active">Section 2</span>))

// Extract the text inside each "p" element

doc >> texts("p")
// res29: Iterable[String] = List(Some text for testing, More text for testing)
// res11: Iterable[String] = List(Some text for testing, More text for testing)
```

## Content Validation
Expand Down Expand Up @@ -234,7 +245,7 @@ Some validation examples:
```scala
// Check if the title of the page is "Test page"
doc >/~ validator(text("title"))(_ == "Test page")
// res31: Either[Unit,browser.DocumentType] =
// res12: Either[Unit,browser.DocumentType] =
// Right(JsoupDocument(<!doctype html>
// <html lang="en">
// <head>
Expand Down Expand Up @@ -264,12 +275,14 @@ doc >/~ validator(text("title"))(_ == "Test page")
// <p>Some text ...

// Check if there are at least 3 ".active" elements

doc >/~ validator(".active")(_.size >= 3)
// res33: Either[Unit,browser.DocumentType] = Left(())
// res13: Either[Unit,browser.DocumentType] = Left(())

// Check if the text in ".desc" contains the word "blue"

doc >/~ validator(allText("#mytable"))(_.contains("blue"))
// res35: Either[Unit,browser.DocumentType] = Left(())
// res14: Either[Unit,browser.DocumentType] = Left(())
```

When a document fails a validation, it may be useful to identify the problem by pattern-matching it against common scraping pitfalls, such as a login page that appears unexpectedly because of an expired cookie, dynamic content that disappeared or server-side errors. If we define validators for both the success case and error cases:
Expand All @@ -287,7 +300,7 @@ They can be used in combination to create more informative validations:

```scala
doc >/~ (succ, errors)
// res37: Either[String,browser.DocumentType] = Left(Too few items)
// res15: Either[String,browser.DocumentType] = Left(Too few items)
```

Validators matching errors were constructed above using an additional `result` parameter after the extractor. That value is returned wrapped in a `Left` if that particular error occurs during a validation.
Expand All @@ -299,7 +312,7 @@ As shown before in the Quick Start section, one can try if an extractor works in
```scala
// Try to extract an element with id "optional", return `None` if none exist
doc >?> element("#optional")
// res39: Option[net.ruippeixotog.scalascraper.model.Element] = None
// res16: Option[net.ruippeixotog.scalascraper.model.Element] = None
```

Note that when using `>?>` with content extractors that return sequences, such as `texts` and `elements`, `None` will never be returned (`Some(Seq())` will be returned instead).
Expand All @@ -309,40 +322,45 @@ If you want to use multiple extractors in a single document or element, you can
```scala
// Extract the text of the title element and all inputs of #myform
doc >> (text("title"), elementList("#myform input"))
// res41: (String, List[net.ruippeixotog.scalascraper.model.Element]) = (Test page,List(JsoupElement(<input type="text" name="name" value="John">), JsoupElement(<input type="text" name="address">), JsoupElement(<input type="submit" value="Submit">)))
// res17: (String, List[net.ruippeixotog.scalascraper.model.Element]) = (Test page,List(JsoupElement(<input type="text" name="name" value="John">), JsoupElement(<input type="text" name="address">), JsoupElement(<input type="submit" value="Submit">)))
```

The extraction operators work on `List`, `Option`, `Either` and other instances for which a [Scalaz](https://github.com/scalaz/scalaz) `Functor` instance exists. The extraction occurs by mapping over the functors:

```scala
// Extract the titles of all documents in the list
List(doc, doc) >> text("title")
// res43: List[String] = List(Test page, Test page)
// res18: List[String] = List(Test page, Test page)

// Extract the title if the document is a `Some`

Option(doc) >> text("title")
// res45: Option[String] = Some(Test page)
// res19: Option[String] = Some(Test page)
```

You can apply other extractors and validators to the result of an extraction, which is particularly powerful combined with the feature shown above:

```scala
// From the "#menu" element, extract the text in the ".active" element inside
doc >> element("#menu") >> text(".active")
// res47: String = Section 2
// res20: String = Section 2

// Same as above, but in a scenario where "#menu" can be absent

doc >?> element("#menu") >> text(".active")
// res49: Option[String] = Some(Section 2)
// res21: Option[String] = Some(Section 2)

// Same as above, but check if the "#menu" has any "span" element before

// extracting the text

doc >?> element("#menu") >/~ validator("span")(_.nonEmpty) >> text(".active")
// res52: Option[scala.util.Either[Unit,String]] = Some(Right(Section 2))
// res22: Option[scala.util.Either[Unit,String]] = Some(Right(Section 2))

// Extract the links inside all the "#menu > span" elements

doc >> elementList("#menu > span") >?> attr("href")("a")
// res54: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3))
// res23: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3))
```

This library also provides a `Functor` for `HtmlExtractor`, making it possible to map over extractors and create chained extractors that can be passed around and stored like objects. For example, new extractors can be defined like this:
Expand Down Expand Up @@ -371,40 +389,44 @@ And they can be used just as extractors created using other means provided by th

```scala
doc >> spanLinks
// res60: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3), None, None, None, None, None, Some(#), None)
// res24: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3), None, None, None, None, None, Some(#), None)

doc >> spanLinksCount
// res61: Int = 4
// res25: Int = 4

doc >> menuLinks
// res62: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3))
// res26: List[Option[String]] = List(Some(#home), Some(#section1), None, Some(#section3))
```

Just remember that you can only apply extraction operators `>>` and `>?>` to documents, elements or functors "containing" them, which means that the following is a compile-time error:

```scala
// The `texts` extractor extracts a list of strings and extractors cannot be

// applied to strings

doc >> texts("#menu > span") >> "a"
// <console>:30: error: value >> is not a member of Iterable[String]
// doc >> texts("#menu > span") >> "a"
// ^
// On line 2: error: value >> is not a member of Iterable[String]
```

Finally, if you prefer not using operators for the sake of code legibility, you can use alternative methods:

```scala
// `extract` is the same as `>>`
doc extract text("title")
// res67: String = Test page
// res28: String = Test page

// `tryExtract` is the same as `>?>`

doc tryExtract element("#optional")
// res69: Option[net.ruippeixotog.scalascraper.model.Element] = None
// res29: Option[net.ruippeixotog.scalascraper.model.Element] = None

// `validateWith` is the same as `>/~`

doc validateWith (succ, errors)
// res71: Either[String,browser.DocumentType] = Left(Too few items)
// res30: Either[String,browser.DocumentType] = Left(Too few items)
```

## Using Browser-Specific Features
Expand Down Expand Up @@ -441,39 +463,56 @@ Note that extracting using CSS queries also keeps the concrete types of the elem
```scala
// same thing as above
typedDoc >> "#menu" >> "span:nth-child(2)" >> "a" >> pElement
// res78: net.ruippeixotog.scalascraper.dsl.DSL.Extract.pElement.Out[net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.HtmlUnitElement] = HtmlUnitElement(HtmlAnchor[<a href="#section1">])
// res31: net.ruippeixotog.scalascraper.dsl.DSL.Extract.pElement.Out[net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.HtmlUnitElement] = HtmlUnitElement(HtmlAnchor[<a href="#section1">])
```

Concrete element types, like `HtmlUnitElement`, expose a public `underlying` field with the underlying element object used by the browser backend. In the case of HtmlUnit, that would be a [`DomElement`](http://htmlunit.sourceforge.net/apidocs/com/gargoylesoftware/htmlunit/html/DomElement.html), which exposes a whole new range of operations:

```scala
// extract the current "href" this "a" element points to
aElem >> attr("href")
// res80: String = #section1
// res32: String = #section1

// use `underlying` to update the "href" attribute

aElem.underlying.setAttribute("href", "#section1_2")

// verify that "href" was updated

aElem >> attr("href")
// res84: String = #section1_2
// res34: String = #section1_2

// get the location of the document (without the host and the full path parts)

typedDoc.location.split("/").last
// res86: String = example.html
// res35: String = example.html

def click(elem: HtmlUnitElement) {
// def click(elem: HtmlUnitElement) {
// ^
// On line 2: warning: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `click`'s return type
// the type param may be needed, as the original API uses Java wildcards
// def click(elem: HtmlUnitElement) {
// ^
// On line 2: warning: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `click`'s return type
aElem.underlying.click[com.gargoylesoftware.htmlunit.Page]()
// def click(elem: HtmlUnitElement) {
// ^
// On line 2: warning: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `click`'s return type
}
// def click(elem: HtmlUnitElement) {
// ^
// On line 2: warning: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `click`'s return type
// click: (elem: net.ruippeixotog.scalascraper.browser.HtmlUnitBrowser.HtmlUnitElement)Unit

// simulate a click on our recently modified element

click(aElem)

// check the new location

typedDoc.location.split("/").last
// res90: String = example.html#section1_2
// res37: String = example.html#section1_2
```

Using the typed element API provides much more flexibility when more than querying elements is required. However, one should avoid using it unless strictly necessary, as:
Expand Down
31 changes: 19 additions & 12 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@ import scalariform.formatter.preferences._

organization in ThisBuild := "net.ruippeixotog"

scalaVersion in ThisBuild := "2.12.8"
Fristi marked this conversation as resolved.
Show resolved Hide resolved
crossScalaVersions in ThisBuild := Seq("2.11.12", "2.12.8")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why you removed Scala 2.11? Did any library stop supporting it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Http4s does not seem to support it anymore

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Roughly 30% of scala-scraper downloads are for Scala 2.11, according to Sonatype. Dropping support for that Scala version just because a library that we only use in tests doesn't support it anymore doesn't seem a valid reason, unfortunately 😞

This can be addressed by migrating scala-scraper tests to a different HTTP library. It's outside the scope of this PR, however. I'll try to do it next week in order to unblock this as soon as possible.

scalaVersion in ThisBuild := "2.13.0"
crossScalaVersions in ThisBuild := Seq("2.12.8", "2.13.0")

lazy val core = project.in(file("core"))
.enablePlugins(TutPlugin)
Expand All @@ -16,14 +16,22 @@ lazy val core = project.in(file("core"))
"com.github.nscala-time" %% "nscala-time" % "2.22.0",
"net.sourceforge.htmlunit" % "htmlunit" % "2.34.1",
"org.jsoup" % "jsoup" % "1.11.3",
"org.scalaz" %% "scalaz-core" % "7.2.27",
"org.http4s" %% "http4s-blaze-server" % "0.17.6" % "test",
"org.http4s" %% "http4s-dsl" % "0.17.6" % "test",
"org.scalaz" %% "scalaz-core" % "7.2.28",
"org.http4s" %% "http4s-blaze-server" % "0.21.0-M4" % "test",
"org.http4s" %% "http4s-dsl" % "0.21.0-M4" % "test",
"org.slf4j" % "slf4j-nop" % "1.7.26" % "test",
"org.specs2" %% "specs2-core" % "4.5.1" % "test"),

tutTargetDirectory := file("."))

val baseScalacOptions = Seq(
"-deprecation",
"-unchecked",
"-feature",
"-language:implicitConversions",
"-language:higherKinds"
)

lazy val config = project.in(file("modules/config"))
.dependsOn(core)
.enablePlugins(TutPlugin)
Expand All @@ -45,13 +53,12 @@ lazy val commonSettings = Seq(
.setPreference(DoubleIndentConstructorArguments, true)
.setPreference(PlaceScaladocAsterisksBeneathSecondAsterisk, true),

scalacOptions ++= Seq(
"-deprecation",
"-unchecked",
"-feature",
"-language:implicitConversions",
"-language:higherKinds",
"-Ypartial-unification"),
scalacOptions ++= baseScalacOptions ++ (CrossVersion
.partialVersion(scalaVersion.value) match {
case Some((2, scalaMajor)) if scalaMajor == 12 =>
Seq("-Ypartial-unification")
case _ => Seq.empty[String]
}),

fork in Test := true,

Expand Down
Loading