[SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API #26555

dongjoon-hyun · 2019-11-16T06:58:01Z

What changes were proposed in this pull request?

[SPARK-29376] Upgrade Apache Arrow to version 0.15.1 upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade SparkR to use Arrow 0.15 API. Currently, it's broken.

Why are the changes needed?

First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at ARROW-5505 and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now.

Jenkins

Skipped ------------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (@test_sparkSQL_arrow.R#25)
- arrow not installed

Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14.

Warnings -----------------------------------------------------------------------
1. createDataFrame/collect Arrow optimization (@test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space
	at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
	at java.nio.ByteBuffer.allocate(ByteBuffer.java:335)
	at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669)
	at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243)

It is due to the version mismatch.

int messageLength = MessageSerializer.bytesToInt(buffer.array());
if (messageLength == IPC_CONTINUATION_TOKEN) {
  buffer.clear();
  // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length
  if (in.readFully(buffer) == 4) {
    messageLength = MessageSerializer.bytesToInt(buffer.array());
  }
}

// Length of 0 indicates end of stream
if (messageLength != 0) {
  // Read the message into the buffer.
  ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);

After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Pass the AppVeyor.

This PR passed here.

https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044

SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\..
................

dongjoon-hyun · 2019-11-16T08:06:31Z

AppVeyor is broken as we see in this PR.

https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28902071

srowen · 2019-11-16T13:01:14Z

It seems like a good idea to match these, yeah. Different error now?

6. Error: gapply() Arrow optimization - type specification (date and timestamp) (@test_sparkSQL_arrow.R#285) 
invalid 'n' argument
1: tryCatch({
       ret <- gapply(df, "a", function(key, grouped) {
           grouped
       }, schema(df))
       expect_equal(collect(ret), rdf)
   }, finally = {
       callJMethod(conf, "set", "spark.sql.execution.arrow.sparkr.enabled", arrowEnabled)
   }) at C:\projects\spark\bin\../R/pkg/tests/fulltests/test_sparkSQL_arrow.R:285
2: tryCatchList(expr, classes, parentenv, handlers)
3: expect_equal(collect(ret), rdf) at C:\projects\spark\bin\../R/pkg/tests/fulltests/test_sparkSQL_arrow.R:289
4: compare(object, expected, ...)
5: collect(ret)
6: collect(ret)
7: .local(x, ...)
8: tryCatch({
       doServerAuth(conn, authSecret)
       arrowTable <- read_arrow(readRaw(conn))
       if (useAsTibble) {
           as_tibble <- get("as_tibble", envir = asNamespace("arrow"))
           as.data.frame(as_tibble(arrowTable), stringsAsFactors = stringsAsFactors)
       }
       else {
           as.data.frame(arrowTable, stringsAsFactors = stringsAsFactors)
       }
   }, finally = {
       close(conn)
   })
9: tryCatchList(expr, classes, parentenv, handlers)
10: read_arrow(readRaw(conn))
11: as.data.frame(read_table(stream))
12: read_table(stream)
13: readRaw(conn)
14: readBin(con, raw(), as.integer(dataLen), endian = "big")

Still seems like some protocol / version mismatch. I don't know much about this but looks like readRaw gets back info from conn that causes it to try to read an invalid amount, like -1. I think that kind of "-1" return has something to do with being out of memory, so maybe back where we started.

Does the machine need more memory or is that irrelevant?

dongjoon-hyun · 2019-11-16T16:44:09Z

Correct. The previous one was due to version mismatch. messageLength seems to be read in a wrong way.

      int messageLength = MessageSerializer.bytesToInt(buffer.array());
      if (messageLength == IPC_CONTINUATION_TOKEN) {
        buffer.clear();
        // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length
        if (in.readFully(buffer) == 4) {
          messageLength = MessageSerializer.bytesToInt(buffer.array());
        }
      }

      // Length of 0 indicates end of stream
      if (messageLength != 0) {

        // Read the message into the buffer.
        ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength);

And, yes. Now, with this PR, it seems to show the previous blocker of SPARK-29378?

1. createDataFrame/collect Arrow optimization (@test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization.
Reason: Error in arrow::FileOutputStream(fileName): attempt to apply non-function

dongjoon-hyun · 2019-11-16T17:10:00Z

R/pkg/R/SQLContext.R

          schema <- batch$schema
-          stream_writer <- arrow::RecordBatchStreamWriter(stream, schema)
+          stream_writer <- arrow::RecordBatchStreamWriter$create(stream, schema)


ARROW-5505 changes their R API. cc @srowen , @HyukjinKwon , @felixcheung

@shaneknapp . It seems that our Jenkins doesn't have R Arrow installation and skip these test.

This is not Windows/JDK8 issue. Arrow 0.15 has a breaking change in R interface.

@shaneknapp . It seems that our Jenkins doesn't have R Arrow installation and skip these test.

this is correct. should i infer that we need to add R Arrow to our worker nodes?

Actually, I configured AppVeyor to test Arrow related ones in R for now. So, at least it's being tested in AppVeyor.

Yes, it would be nice if we have Arrow R installed in Jenkins workers to test it out; however, I remember it's a bit tricky to install Arrow R library for now (you cannot just install via CRAN but needs some manual preparation - see https://github.com/apache/arrow/tree/master/r#installation).

I am worried if it messes up the current Jenkins worker's environment. If you see any concern, @shaneknapp, you can just don't install it for now. It's being tested via AppVeyor at least so it's neither a must or urgent.

actually, R arrow installed easily and w/o issue (surprising, i know!). :)

sknapp@ubuntu-testing:~$ Rscript -e "packageVersion('arrow')" [1] ‘0.15.1.1’

i will install this on the remaining workers later today/tomorrow.

Thank you, @shaneknapp !

### What changes were proposed in this pull request? This PR aims to ignore `GitHub Action` and `AppVeyor` file changes. When we touch these files, Jenkins job should not trigger a full testing. ### Why are the changes needed? Currently, these files are categorized to `root` and trigger the full testing and ends up wasting the Jenkins resources. - #26555 ``` [info] Using build tool sbt with Hadoop profile hadoop2.7 under environment amplab_jenkins From https://github.com/apache/spark * [new branch] master -> master [info] Found the following changed modules: sparkr, root [info] Setup the following environment variables for tests: ``` ### Does this PR introduce any user-facing change? No. (Jenkins testing only). ### How was this patch tested? Manually. ``` $ dev/run-tests.py -h -v ... Trying: [x.name for x in determine_modules_for_files([".github/workflows/master.yml", "appveyor.xml"])] Expecting: [] ... ``` Closes #26556 from dongjoon-hyun/SPARK-IGNORE-APPVEYOR. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

BryanCutler

Thanks for fixing this @dongjoon-hyun !

BryanCutler · 2019-11-16T18:18:46Z

appveyor.yml

@@ -43,9 +43,8 @@ install:
  - ps: .\dev\appveyor-install-dependencies.ps1
  # Required package for R unit tests
  - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'e1071', 'survival'), repos='https://cloud.r-project.org/')"


Would it be better to restore this line from #26041 to install arrow?

cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'e1071', 'survival', 'arrow'), repos='https://cloud.r-project.org/')"

Thanks. Yes. That will be better.

dongjoon-hyun · 2019-11-16T18:34:50Z

For the other errors, I'm working on.

SparkQA · 2019-11-16T20:22:33Z

Test build #113945 has finished for PR 26555 at commit 96f20f8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2019-11-16T21:01:23Z

Now, this PR passed all R tests (Jenkins + AppVeyor including Arrow 0.15).
Could you review this PR again, @srowen , @BryanCutler , @HyukjinKwon , @felixcheung , @gatorsmile ?

viirya

I think this means the minimum version is 0.15? Then we should also update document like sparkr.md.

dongjoon-hyun · 2019-11-17T01:57:08Z

There is already a documentation JIRA for that, @viirya

https://issues.apache.org/jira/browse/SPARK-29924

viirya · 2019-11-17T02:07:01Z

@dongjoon-hyun Thanks! I see. Then let's update the R document in that.

viirya · 2019-11-17T02:22:39Z

From the Arrow PR apache/arrow#5279, this change looks good.

dongjoon-hyun

Thank you all for review. I'll merge this PR to recover the AppVeyor build and test.
Since it's only for Window-environment, we need to enable this arrow test in Jenkins to protect our branch.

HyukjinKwon

Havnt taken a close look but looks good. Thanks @dongjoon-hyun

dongjoon-hyun · 2019-11-17T06:43:38Z

Thanks, @HyukjinKwon !

HyukjinKwon · 2019-11-18T02:01:39Z

appveyor.yml

@@ -42,10 +42,8 @@ install:
  # Install maven and dependencies
  - ps: .\dev\appveyor-install-dependencies.ps1
  # Required package for R unit tests
-  - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'e1071', 'survival'), repos='https://cloud.r-project.org/')"
-  # Use Arrow R 0.14.1 for now. 0.15.0 seems not working for now. See SPARK-29378.
+  - cmd: R -e "install.packages(c('knitr', 'rmarkdown', 'e1071', 'survival', 'arrow'), repos='https://cloud.r-project.org/')"
  - cmd: R -e "install.packages(c('assertthat', 'bit64', 'fs', 'purrr', 'R6', 'tidyselect'), repos='https://cloud.r-project.org/')"


This line was to install Arrow dependencies manually (because devtools started to require testthat dependency as 2.0.0+; however, SparkR requires 1.0.2). So, I think we can remove this line too. Let me make a quick followup.

Thanks. Go ahead!

Made a followup - #26566. For clarification It actually makes no diff but only less codes.

…dencies in AppVeyor build ### What changes were proposed in this pull request? This PR remove manual installation of Arrow dependencies in AppVeyor build ### Why are the changes needed? It's unnecessary. See #26555 (comment) ### Does this PR introduce any user-facing change? No ### How was this patch tested? AppVeyor will test. Closes #26566 from HyukjinKwon/SPARK-29378. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

### What changes were proposed in this pull request? This PR aims to ignore `GitHub Action` and `AppVeyor` file changes. When we touch these files, Jenkins job should not trigger a full testing. ### Why are the changes needed? Currently, these files are categorized to `root` and trigger the full testing and ends up wasting the Jenkins resources. - #26555 ``` [info] Using build tool sbt with Hadoop profile hadoop2.7 under environment amplab_jenkins From https://github.com/apache/spark * [new branch] master -> master [info] Found the following changed modules: sparkr, root [info] Setup the following environment variables for tests: ``` ### Does this PR introduce any user-facing change? No. (Jenkins testing only). ### How was this patch tested? Manually. ``` $ dev/run-tests.py -h -v ... Trying: [x.name for x in determine_modules_for_files([".github/workflows/master.yml", "appveyor.xml"])] Expecting: [] ... ``` Closes #26556 from dongjoon-hyun/SPARK-IGNORE-APPVEYOR. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d0470d6) Signed-off-by: Dongjoon Hyun <[email protected]>

[[SPARK-29376] Upgrade Apache Arrow to version 0.15.1](apache#26133) upgrades to Arrow 0.15 at Scala/Java/Python. This PR aims to upgrade `SparkR` to use Arrow 0.15 API. Currently, it's broken. First of all, it turns out that our Jenkins jobs (including PR builder) ignores Arrow test. Arrow 0.15 has a breaking R API changes at [ARROW-5505](https://issues.apache.org/jira/browse/ARROW-5505) and we missed that. AppVeyor was the only one having SparkR Arrow tests but it's broken now. **Jenkins** ``` Skipped ------------------------------------------------------------------------ 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#25) - arrow not installed ``` Second, Arrow throws OOM on AppVeyor environment (Windows JDK8) like the following because it still has Arrow 0.14. ``` Warnings ----------------------------------------------------------------------- 1. createDataFrame/collect Arrow optimization (test_sparkSQL_arrow.R#39) - createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.sparkr.enabled' is set to true; however, failed, attempting non-optimization. Reason: Error in handleErrors(returnStatus, conn): java.lang.OutOfMemoryError: Java heap space at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57) at java.nio.ByteBuffer.allocate(ByteBuffer.java:335) at org.apache.arrow.vector.ipc.message.MessageSerializer.readMessage(MessageSerializer.java:669) at org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$3.readNextBatch(ArrowConverters.scala:243) ``` It is due to the version mismatch. ```java int messageLength = MessageSerializer.bytesToInt(buffer.array()); if (messageLength == IPC_CONTINUATION_TOKEN) { buffer.clear(); // ARROW-6313, if the first 4 bytes are continuation message, read the next 4 for the length if (in.readFully(buffer) == 4) { messageLength = MessageSerializer.bytesToInt(buffer.array()); } } // Length of 0 indicates end of stream if (messageLength != 0) { // Read the message into the buffer. ByteBuffer messageBuffer = ByteBuffer.allocate(messageLength); ``` After upgrading this to 0.15, we are hitting ARROW-5505. This PR upgrades Arrow version in AppVeyor and fix the issue. No. Pass the AppVeyor. This PR passed here. - https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark/builds/28909044 ``` SparkSQL Arrow optimization: Spark package found in SPARK_HOME: C:\projects\spark\bin\.. ................ ``` Closes apache#26555 from dongjoon-hyun/SPARK-R-TEST. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

AppVeyor test

3519b23

dongjoon-hyun mentioned this pull request Nov 16, 2019

[SPARK-29376][SQL][PYTHON] Upgrade Apache Arrow to version 0.15.1 #26133

Closed

dongjoon-hyun changed the title ~~AppVeyor test~~ AppVeyor Arrow test Nov 16, 2019

dongjoon-hyun changed the title ~~AppVeyor Arrow test~~ [WIP] AppVeyor Arrow test Nov 16, 2019

dongjoon-hyun changed the title ~~[WIP] AppVeyor Arrow test~~ [WIP] AppVeyor Arrow test (JDK8) Nov 16, 2019

This comment has been minimized.

Sign in to view

dongjoon-hyun mentioned this pull request Nov 16, 2019

[SPARK-29923][SQL][TESTS] Set io.netty.tryReflectionSetAccessible for Arrow on JDK9+ #26552

Closed

0.15.1 test

25c46a0

dongjoon-hyun mentioned this pull request Nov 16, 2019

[MINOR][TESTS] Ignore GitHub Action and AppVeyor file changes in testing #26556

Closed

This comment has been minimized.

Sign in to view

dongjoon-hyun mentioned this pull request Nov 16, 2019

[SPARK-29403][INFRA][R] Uses Arrow R 0.14.1 in AppVeyor for now #26041

Closed

Update API

b486093

dongjoon-hyun changed the title ~~[WIP] AppVeyor Arrow test (JDK8)~~ [SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019

View reviewed changes

Revert unneeded

a4dc812

dongjoon-hyun added the SPARKR label Nov 16, 2019

This comment has been minimized.

Sign in to view

BryanCutler reviewed Nov 16, 2019

View reviewed changes

Address comment

ea5304d

This comment has been minimized.

Sign in to view

Update RecordBatchStreamReader

96f20f8

dongjoon-hyun changed the title ~~[SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API~~ [WIP][SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API Nov 16, 2019

dongjoon-hyun changed the title ~~[WIP][SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API~~ [SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API Nov 16, 2019

viirya reviewed Nov 17, 2019

View reviewed changes

dongjoon-hyun commented Nov 17, 2019

View reviewed changes

dongjoon-hyun closed this in cc12cf6 Nov 17, 2019

dongjoon-hyun deleted the SPARK-R-TEST branch November 17, 2019 02:29

HyukjinKwon reviewed Nov 17, 2019

View reviewed changes

HyukjinKwon reviewed Nov 18, 2019

View reviewed changes

HyukjinKwon mentioned this pull request Nov 18, 2019

[SPARK-29378][R][FOLLOW-UP] Remove manual installation of Arrow dependencies in AppVeyor build #26566

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API #26555

[SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API #26555

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading

This comment has been minimized.

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading

This comment has been minimized.

srowen commented Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019

dongjoon-hyun Nov 16, 2019 •

edited

Loading

dongjoon-hyun Nov 16, 2019

dongjoon-hyun Nov 16, 2019

shaneknapp Nov 19, 2019

HyukjinKwon Nov 20, 2019

shaneknapp Nov 20, 2019

dongjoon-hyun Nov 20, 2019

HyukjinKwon Nov 21, 2019

This comment has been minimized.

BryanCutler left a comment

BryanCutler Nov 16, 2019

dongjoon-hyun Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019

This comment has been minimized.

SparkQA commented Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019

viirya left a comment

dongjoon-hyun commented Nov 17, 2019

viirya commented Nov 17, 2019

viirya commented Nov 17, 2019

dongjoon-hyun left a comment

HyukjinKwon left a comment •

edited

Loading

dongjoon-hyun commented Nov 17, 2019

HyukjinKwon Nov 18, 2019

dongjoon-hyun Nov 18, 2019

HyukjinKwon Nov 18, 2019

[SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API #26555

[SPARK-29378][R] Upgrade SparkR to use Arrow 0.15 API #26555

Conversation

dongjoon-hyun commented Nov 16, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

This comment has been minimized.

dongjoon-hyun commented Nov 16, 2019 • edited Loading

This comment has been minimized.

srowen commented Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019

dongjoon-hyun Nov 16, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment has been minimized.

BryanCutler left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 16, 2019

This comment has been minimized.

SparkQA commented Nov 16, 2019

dongjoon-hyun commented Nov 16, 2019

viirya left a comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 17, 2019

viirya commented Nov 17, 2019

viirya commented Nov 17, 2019

dongjoon-hyun left a comment

Choose a reason for hiding this comment

HyukjinKwon left a comment • edited Loading

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 17, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading

dongjoon-hyun commented Nov 16, 2019 •

edited

Loading

dongjoon-hyun Nov 16, 2019 •

edited

Loading

HyukjinKwon left a comment •

edited

Loading