-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) #25993
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
4b7b427
to
1e2440c
Compare
@@ -50,7 +50,7 @@ compute <- function(mode, partition, serializer, deserializer, key, | |||
} else { | |||
# Check to see if inputData is a valid data.frame | |||
stopifnot(deserializer == "byte" || deserializer == "arrow") | |||
stopifnot(class(inputData) == "data.frame") | |||
stopifnot(is.data.frame(inputData)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Arrow 0.14, seems as.data.frame(arrow_batch)
returns non-naive R DataFrame ("tbl_df tbl data.frame"). So, direct class comparison doesn't look working here.
cc @felixcheung and @viirya |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
- cmd: R -e "install.packages('https://cloud.r-project.org/src/contrib/Archive/testthat/testthat_1.0.2.tar.gz', repos=NULL, type='source')" | ||
- cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')" | ||
- cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival'); packageVersion('arrow')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we specify the arrow version to install? just wonder about compatibility issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe .. it will needs to install devtools
which upgrades testthat
... so it's a bit tricky to install a specific version for now.
I was thinking about testing the latest in AppVeyor and old version at Jenkins anyway. should be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SG
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Rscript -e 'install.packages("arrow", repos="https://cloud.r-project.org/")' | ||
``` | ||
|
||
If you need to install old versions, it should be installed directly from Github. You can use `remotes::install_github` as below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If installing old versions, above requireNamespace changes are still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's fine without requireNamespace changes, it will work because the workaround was used due to missing arrow
CRAN. Since it's on CRAN, it will be able to check CRAN against SparkR properly now.
For users, CRAN check won't matter anyway. So should be fine.
I checked that CRAN passed in my local and dapply
and gapply
work manually.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
d8f6460
to
e4c755c
Compare
R/check-cran.sh
Outdated
@@ -65,6 +65,10 @@ fi | |||
|
|||
echo "Running CRAN check with $CRAN_CHECK_OPTIONS options" | |||
|
|||
# Remove this environment variable to allow to check suggested packages once | |||
# Jenkins installs arrow. See SPARK-29339. | |||
_R_CHECK_FORCE_SUGGESTS_=FALSE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I manually checked that CRAN check passes in my local without this environment variable
In Jenkins, it seems failed because Arrow R library itself is not found on Jenkins somehow
Package suggested but not available: ���arrow���
The suggested packages are required for a complete check.
Checking can be attempted without them by setting the environment
variable _R_CHECK_FORCE_SUGGESTS_ to a false value.
I plan to change the minimum version soon (as it's experimental, it might be fine). I will work with Shane to install Arrow R library in Jenkins after that.
@BryanCutler, BTW, shall we upgrade Arrow in Spark from 0.12 to 0.14 at pom.xml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow 0.15.0 will be released in a couple days, so I think we should wait for that. It has an important forwards-compatibility fix in it that will be good to have.
e4c755c
to
a31d18d
Compare
a31d18d
to
ab05615
Compare
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Test build #111735 has finished for PR 25993 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. FYI, Arrow 0.15.0 will be release in a few days and a 1.0.0 release is planned to be soon after. I think we should upgrade Java to 0.15.0 and make sure older versions of pyarrow are still ok (I've been doing some tests and I think it should be fine). If there are no issues, we might hold off on increasing the minimum version until 1.0.0 is released, but we can discuss it more later.
The last change is unrelated to AppVeyor test failure (timeout). Merged to master. Thanks, @felixcheung, @viirya and @BryanCutler |
Guys, just FYI, I am preparing some benchmarks and notebooks at https://github.com/HyukjinKwon/spark-notebooks for Spark summit It has a live notebook that installed Arrow R and PyArrow with the latest master of Spark. You can try out right away in a Jupyter notebook here: |
Very cool
…________________________________
From: Hyukjin Kwon <[email protected]>
Sent: Saturday, October 5, 2019 6:23:57 AM
To: apache/spark <[email protected]>
Cc: Felix Cheung <[email protected]>; Mention <[email protected]>
Subject: Re: [apache/spark] [SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) (#25993)
Guys, just FYI, I am preparing some benchmarks and notebooks at https://github.com/HyukjinKwon/spark-notebooks.
It has a live notebook that installed Arrow R and PyArrow with the latest master of Spark. You can try out right away in a Jupyter notebook here: [Binder] <https://mybinder.org/v2/gh/HyukjinKwon/spark-notebooks/master?filepath=SAIS-2019.ipynb> . It takes a couple of minutes to launch a Jupyter server.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#25993?email_source=notifications&email_token=ACENZ67RI54ZHW36PNUMFN3QNCIO3A5CNFSM4I4RKZYKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEANSGHI#issuecomment-538649373>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ACENZ64DHQTLGKDZKSIBDE3QNCIO3ANCNFSM4I4RKZYA>.
|
…est it in AppVeyor build) This PR proposes: 1. Use `is.data.frame` to check if it is a DataFrame. 2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI. 1. To support SparkR with Arrow 0.14 2. To check if there's any regression and if it works correctly. ```r df <- createDataFrame(mtcars) collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double"))) ``` **Before:** ``` Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument ``` **After:** ``` gear 1 5 2 5 3 5 4 4 5 4 6 4 7 4 8 5 9 5 ... ``` AppVeyor Closes apache#25993 from HyukjinKwon/arrow-r-appveyor. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…est it in AppVeyor build) This PR proposes: 1. Use `is.data.frame` to check if it is a DataFrame. 2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI. 1. To support SparkR with Arrow 0.14 2. To check if there's any regression and if it works correctly. ```r df <- createDataFrame(mtcars) collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double"))) ``` **Before:** ``` Error in readBin(con, raw(), as.integer(dataLen), endian = "big") : invalid 'n' argument ``` **After:** ``` gear 1 5 2 5 3 5 4 4 5 4 6 4 7 4 8 5 9 5 ... ``` AppVeyor Closes apache#25993 from HyukjinKwon/arrow-r-appveyor. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
This PR proposes:
is.data.frame
to check if it is a DataFrame.Why are the changes needed?
Does this PR introduce any user-facing change?
Before:
After:
How was this patch tested?
AppVeyor