[SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) #25993

HyukjinKwon · 2019-10-02T03:39:42Z

What changes were proposed in this pull request?

This PR proposes:

Use is.data.frame to check if it is a DataFrame.
to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI.

Why are the changes needed?

To support SparkR with Arrow 0.14
To check if there's any regression and if it works correctly.

Does this PR introduce any user-facing change?

df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))

Before:

Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
  invalid 'n' argument

After:

How was this patch tested?

AppVeyor

HyukjinKwon · 2019-10-03T02:57:15Z

R/pkg/inst/worker/worker.R

@@ -50,7 +50,7 @@ compute <- function(mode, partition, serializer, deserializer, key,
    } else {
      # Check to see if inputData is a valid data.frame
      stopifnot(deserializer == "byte" || deserializer == "arrow")
-      stopifnot(class(inputData) == "data.frame")
+      stopifnot(is.data.frame(inputData))


In Arrow 0.14, seems as.data.frame(arrow_batch) returns non-naive R DataFrame ("tbl_df tbl data.frame"). So, direct class comparison doesn't look working here.

HyukjinKwon · 2019-10-03T03:04:52Z

cc @felixcheung and @viirya

felixcheung

LG

felixcheung · 2019-10-03T03:49:57Z

appveyor.yml

  - cmd: R -e "install.packages('https://cloud.r-project.org/src/contrib/Archive/testthat/testthat_1.0.2.tar.gz', repos=NULL, type='source')"
-  - cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')"
+  - cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival'); packageVersion('arrow')"


should we specify the arrow version to install? just wonder about compatibility issue

Maybe .. it will needs to install devtools which upgrades testthat ... so it's a bit tricky to install a specific version for now.

I was thinking about testing the latest in AppVeyor and old version at Jenkins anyway. should be fine.

viirya · 2019-10-03T06:29:12Z

docs/sparkr.md

+Rscript -e 'install.packages("arrow", repos="https://cloud.r-project.org/")'
+```
+
+If you need to install old versions, it should be installed directly from Github. You can use `remotes::install_github` as below.


If installing old versions, above requireNamespace changes are still valid?

I think it's fine without requireNamespace changes, it will work because the workaround was used due to missing arrow CRAN. Since it's on CRAN, it will be able to check CRAN against SparkR properly now.

For users, CRAN check won't matter anyway. So should be fine.

I checked that CRAN passed in my local and dapply and gapply work manually.

HyukjinKwon · 2019-10-03T07:14:46Z

R/check-cran.sh

@@ -65,6 +65,10 @@ fi

 echo "Running CRAN check with $CRAN_CHECK_OPTIONS options"

+# Remove this environment variable to allow to check suggested packages once
+# Jenkins installs arrow. See SPARK-29339.
+_R_CHECK_FORCE_SUGGESTS_=FALSE


I manually checked that CRAN check passes in my local without this environment variable

In Jenkins, it seems failed because Arrow R library itself is not found on Jenkins somehow

Package suggested but not available: ��arrow�� The suggested packages are required for a complete check. Checking can be attempted without them by setting the environment variable _R_CHECK_FORCE_SUGGESTS_ to a false value.

I plan to change the minimum version soon (as it's experimental, it might be fine). I will work with Shane to install Arrow R library in Jenkins after that.

@BryanCutler, BTW, shall we upgrade Arrow in Spark from 0.12 to 0.14 at pom.xml?

Arrow 0.15.0 will be released in a couple days, so I think we should wait for that. It has an important forwards-compatibility fix in it that will be good to have.

…r build)

SparkQA · 2019-10-03T13:27:16Z

Test build #111735 has finished for PR 25993 at commit 3cdbf56.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

BryanCutler

LGTM. FYI, Arrow 0.15.0 will be release in a few days and a 1.0.0 release is planned to be soon after. I think we should upgrade Java to 0.15.0 and make sure older versions of pyarrow are still ok (I've been doing some tests and I think it should be fine). If there are no issues, we might hold off on increasing the minimum version until 1.0.0 is released, but we can discuss it more later.

HyukjinKwon · 2019-10-03T23:56:22Z

The last change is unrelated to AppVeyor test failure (timeout).

Merged to master.

Thanks, @felixcheung, @viirya and @BryanCutler

HyukjinKwon · 2019-10-05T13:22:07Z

Guys, just FYI, I am preparing some benchmarks and notebooks at https://github.com/HyukjinKwon/spark-notebooks for Spark summit

It has a live notebook that installed Arrow R and PyArrow with the latest master of Spark. You can try out right away in a Jupyter notebook here: . It takes a couple of minutes to launch a Jupyter server.

felixcheung · 2019-10-05T16:35:39Z

Very cool

…

________________________________ From: Hyukjin Kwon <[email protected]> Sent: Saturday, October 5, 2019 6:23:57 AM To: apache/spark <[email protected]> Cc: Felix Cheung <[email protected]>; Mention <[email protected]> Subject: Re: [apache/spark] [SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) (#25993) Guys, just FYI, I am preparing some benchmarks and notebooks at https://github.com/HyukjinKwon/spark-notebooks. It has a live notebook that installed Arrow R and PyArrow with the latest master of Spark. You can try out right away in a Jupyter notebook here: [Binder] <https://mybinder.org/v2/gh/HyukjinKwon/spark-notebooks/master?filepath=SAIS-2019.ipynb> . It takes a couple of minutes to launch a Jupyter server. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#25993?email_source=notifications&email_token=ACENZ67RI54ZHW36PNUMFN3QNCIO3A5CNFSM4I4RKZYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANSGHI#issuecomment-538649373>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACENZ64DHQTLGKDZKSIBDE3QNCIO3ANCNFSM4I4RKZYA>.

…est it in AppVeyor build) This PR proposes: 1. Use `is.data.frame` to check if it is a DataFrame. 2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI. 1. To support SparkR with Arrow 0.14 2. To check if there's any regression and if it works correctly. ```r df <- createDataFrame(mtcars) collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double"))) ``` **Before:** ``` Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument ``` **After:** ``` gear 1 5 2 5 3 5 4 4 5 4 6 4 7 4 8 5 9 5 ... ``` AppVeyor Closes apache#25993 from HyukjinKwon/arrow-r-appveyor. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

This comment has been minimized.

Sign in to view

dongjoon-hyun added the BUILD label Oct 2, 2019

HyukjinKwon force-pushed the arrow-r-appveyor branch from 4b7b427 to 1e2440c Compare October 3, 2019 02:56

HyukjinKwon commented Oct 3, 2019

View reviewed changes

HyukjinKwon changed the title ~~[DO-NOT-MERGE][R] Install Arrow and test Arrow optimization in AppVeyor build~~ [SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) Oct 3, 2019

felixcheung approved these changes Oct 3, 2019

View reviewed changes

felixcheung added the SPARKR label Oct 3, 2019

This comment has been minimized.

Sign in to view

viirya approved these changes Oct 3, 2019

View reviewed changes

viirya reviewed Oct 3, 2019

View reviewed changes

This comment has been minimized.

Sign in to view

HyukjinKwon force-pushed the arrow-r-appveyor branch from d8f6460 to e4c755c Compare October 3, 2019 07:14

HyukjinKwon commented Oct 3, 2019

View reviewed changes

HyukjinKwon force-pushed the arrow-r-appveyor branch from e4c755c to a31d18d Compare October 3, 2019 07:26

Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyo…

ab05615

…r build)

HyukjinKwon force-pushed the arrow-r-appveyor branch from a31d18d to ab05615 Compare October 3, 2019 07:26

This comment has been minimized.

Sign in to view

export _R_CHECK_FORCE_SUGGESTS_

3cdbf56

This comment has been minimized.

Sign in to view

BryanCutler approved these changes Oct 3, 2019

View reviewed changes

HyukjinKwon closed this in 0f48aaf Oct 3, 2019

HyukjinKwon deleted the arrow-r-appveyor branch March 3, 2020 01:17

HyukjinKwon mentioned this pull request Apr 28, 2020

[SPARK-29339][R][FOLLOW-UP] Remove requireNamespace1 workaround for arrow #28387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) #25993

[SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) #25993

HyukjinKwon commented Oct 2, 2019 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

HyukjinKwon Oct 3, 2019

HyukjinKwon commented Oct 3, 2019

felixcheung left a comment

felixcheung Oct 3, 2019

HyukjinKwon Oct 3, 2019

felixcheung Oct 3, 2019

This comment has been minimized.

This comment has been minimized.

viirya Oct 3, 2019

HyukjinKwon Oct 3, 2019 •

edited

Loading

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

HyukjinKwon Oct 3, 2019 •

edited

Loading

BryanCutler Oct 3, 2019

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Oct 3, 2019

BryanCutler left a comment

HyukjinKwon commented Oct 3, 2019

HyukjinKwon commented Oct 5, 2019 •

edited

Loading

felixcheung commented Oct 5, 2019 via email

[SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) #25993

[SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) #25993

Conversation

HyukjinKwon commented Oct 2, 2019 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

HyukjinKwon Oct 3, 2019

Choose a reason for hiding this comment

HyukjinKwon commented Oct 3, 2019

felixcheung left a comment

Choose a reason for hiding this comment

felixcheung Oct 3, 2019

Choose a reason for hiding this comment

HyukjinKwon Oct 3, 2019

Choose a reason for hiding this comment

felixcheung Oct 3, 2019

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

viirya Oct 3, 2019

Choose a reason for hiding this comment

HyukjinKwon Oct 3, 2019 • edited Loading

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

HyukjinKwon Oct 3, 2019 • edited Loading

Choose a reason for hiding this comment

BryanCutler Oct 3, 2019

Choose a reason for hiding this comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

SparkQA commented Oct 3, 2019

BryanCutler left a comment

Choose a reason for hiding this comment

HyukjinKwon commented Oct 3, 2019

HyukjinKwon commented Oct 5, 2019 • edited Loading

felixcheung commented Oct 5, 2019 via email

HyukjinKwon commented Oct 2, 2019 •

edited

Loading

HyukjinKwon Oct 3, 2019 •

edited

Loading

HyukjinKwon Oct 3, 2019 •

edited

Loading

HyukjinKwon commented Oct 5, 2019 •

edited

Loading