Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) #25993

Closed
wants to merge 2 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Oct 2, 2019

What changes were proposed in this pull request?

This PR proposes:

  1. Use is.data.frame to check if it is a DataFrame.
  2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI.

Why are the changes needed?

  1. To support SparkR with Arrow 0.14
  2. To check if there's any regression and if it works correctly.

Does this PR introduce any user-facing change?

df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))

Before:

Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
  invalid 'n' argument

After:

   gear
1     5
2     5
3     5
4     4
5     4
6     4
7     4
8     5
9     5
...

How was this patch tested?

AppVeyor

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@@ -50,7 +50,7 @@ compute <- function(mode, partition, serializer, deserializer, key,
} else {
# Check to see if inputData is a valid data.frame
stopifnot(deserializer == "byte" || deserializer == "arrow")
stopifnot(class(inputData) == "data.frame")
stopifnot(is.data.frame(inputData))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Arrow 0.14, seems as.data.frame(arrow_batch) returns non-naive R DataFrame ("tbl_df tbl data.frame"). So, direct class comparison doesn't look working here.

@HyukjinKwon HyukjinKwon changed the title [DO-NOT-MERGE][R] Install Arrow and test Arrow optimization in AppVeyor build [SPARK-29339][R] Support Arrow 0.14 in vectoried dapply and gapply (test it in AppVeyor build) Oct 3, 2019
@HyukjinKwon
Copy link
Member Author

cc @felixcheung and @viirya

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

- cmd: R -e "install.packages('https://cloud.r-project.org/src/contrib/Archive/testthat/testthat_1.0.2.tar.gz', repos=NULL, type='source')"
- cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival')"
- cmd: R -e "packageVersion('knitr'); packageVersion('rmarkdown'); packageVersion('testthat'); packageVersion('e1071'); packageVersion('survival'); packageVersion('arrow')"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we specify the arrow version to install? just wonder about compatibility issue

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe .. it will needs to install devtools which upgrades testthat ... so it's a bit tricky to install a specific version for now.

I was thinking about testing the latest in AppVeyor and old version at Jenkins anyway. should be fine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SG

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

Rscript -e 'install.packages("arrow", repos="https://cloud.r-project.org/")'
```

If you need to install old versions, it should be installed directly from Github. You can use `remotes::install_github` as below.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If installing old versions, above requireNamespace changes are still valid?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine without requireNamespace changes, it will work because the workaround was used due to missing arrow CRAN. Since it's on CRAN, it will be able to check CRAN against SparkR properly now.

For users, CRAN check won't matter anyway. So should be fine.

I checked that CRAN passed in my local and dapply and gapply work manually.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

R/check-cran.sh Outdated
@@ -65,6 +65,10 @@ fi

echo "Running CRAN check with $CRAN_CHECK_OPTIONS options"

# Remove this environment variable to allow to check suggested packages once
# Jenkins installs arrow. See SPARK-29339.
_R_CHECK_FORCE_SUGGESTS_=FALSE
Copy link
Member Author

@HyukjinKwon HyukjinKwon Oct 3, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I manually checked that CRAN check passes in my local without this environment variable

In Jenkins, it seems failed because Arrow R library itself is not found on Jenkins somehow

Package suggested but not available: ���arrow���

The suggested packages are required for a complete check.
Checking can be attempted without them by setting the environment
variable _R_CHECK_FORCE_SUGGESTS_ to a false value.

I plan to change the minimum version soon (as it's experimental, it might be fine). I will work with Shane to install Arrow R library in Jenkins after that.

@BryanCutler, BTW, shall we upgrade Arrow in Spark from 0.12 to 0.14 at pom.xml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow 0.15.0 will be released in a couple days, so I think we should wait for that. It has an important forwards-compatibility fix in it that will be good to have.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA

This comment has been minimized.

@SparkQA
Copy link

SparkQA commented Oct 3, 2019

Test build #111735 has finished for PR 25993 at commit 3cdbf56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@BryanCutler BryanCutler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. FYI, Arrow 0.15.0 will be release in a few days and a 1.0.0 release is planned to be soon after. I think we should upgrade Java to 0.15.0 and make sure older versions of pyarrow are still ok (I've been doing some tests and I think it should be fine). If there are no issues, we might hold off on increasing the minimum version until 1.0.0 is released, but we can discuss it more later.

@HyukjinKwon
Copy link
Member Author

The last change is unrelated to AppVeyor test failure (timeout).

Merged to master.

Thanks, @felixcheung, @viirya and @BryanCutler

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Oct 5, 2019

Guys, just FYI, I am preparing some benchmarks and notebooks at https://github.com/HyukjinKwon/spark-notebooks for Spark summit

It has a live notebook that installed Arrow R and PyArrow with the latest master of Spark. You can try out right away in a Jupyter notebook here: Binder . It takes a couple of minutes to launch a Jupyter server.

@felixcheung
Copy link
Member

felixcheung commented Oct 5, 2019 via email

@HyukjinKwon HyukjinKwon deleted the arrow-r-appveyor branch March 3, 2020 01:17
rshkv pushed a commit to palantir/spark that referenced this pull request Jun 24, 2020
…est it in AppVeyor build)

This PR proposes:

1. Use `is.data.frame` to check if it is a DataFrame.
2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI.

1. To support SparkR with Arrow 0.14
2. To check if there's any regression and if it works correctly.

```r
df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
```

**Before:**

```
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
  invalid 'n' argument
```

**After:**

```
   gear
1     5
2     5
3     5
4     4
5     4
6     4
7     4
8     5
9     5
...
```

AppVeyor

Closes apache#25993 from HyukjinKwon/arrow-r-appveyor.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
rshkv pushed a commit to palantir/spark that referenced this pull request Jun 29, 2020
…est it in AppVeyor build)

This PR proposes:

1. Use `is.data.frame` to check if it is a DataFrame.
2. to install Arrow and test Arrow optimization in AppVeyor build. We're currently not testing this in CI.

1. To support SparkR with Arrow 0.14
2. To check if there's any regression and if it works correctly.

```r
df <- createDataFrame(mtcars)
collect(dapply(df, function(rdf) { data.frame(rdf$gear + 1) }, structType("gear double")))
```

**Before:**

```
Error in readBin(con, raw(), as.integer(dataLen), endian = "big") :
  invalid 'n' argument
```

**After:**

```
   gear
1     5
2     5
3     5
4     4
5     4
6     4
7     4
8     5
9     5
...
```

AppVeyor

Closes apache#25993 from HyukjinKwon/arrow-r-appveyor.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants