Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-5505: [R] Normalize file and class names, stop masking base R functions, add vignette, improve documentation #5279

Closed

Conversation

nealrichardson
Copy link
Member

@nealrichardson nealrichardson commented Sep 4, 2019

The main thrust of the changes are summarized in the new vignette:

C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as R6 reference classes, most of which are exported from the namespace.

In order to match the C++ naming conventions, the R6 classes are in TitleCase, e.g. RecordBatch. This makes it easy to look up the relevant C++ implementations in the code or documentation. To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has arrow::io::FileOutputStream, it is just FileOutputStream in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So arrow::csv::TableReader becomes CsvTableReader, and arrow::json::TableReader becomes JsonTableReader.

Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the $create() method to instantiate an object. For example, rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10)) will create a RecordBatch. Many of these factory methods that an R user might most often encounter also have a snake_case alias, in order to be more familiar for contemporary R users. So record_batch(int = 1:10, dbl = as.numeric(1:10)) would do the same as RecordBatch$create() above.

The typical user of the arrow R package may never deal directly with the R6 objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call read_parquet() without knowing or caring that they're instantiating a ParquetFileReader object and calling the $ReadFile() method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.

There are a few other fixes and cleanups rolled in here, named in the individual commit messages below.

I stopped short of more documentation consolidation because (1) this patch is already huge and (2) R6 classes are really tedious to document because it's all manual. I did some searching around and found open issues from 2014 and 2015 about supporting R6 better in roxygen2.

@codecov-io
Copy link

codecov-io commented Sep 5, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@4f7ead8). Click here to learn what that means.
The diff coverage is 81.58%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #5279   +/-   ##
=========================================
  Coverage          ?   76.06%           
=========================================
  Files             ?       56           
  Lines             ?     3572           
  Branches          ?        0           
=========================================
  Hits              ?     2717           
  Misses            ?      855           
  Partials          ?        0
Impacted Files Coverage Δ
r/R/enums.R 0% <ø> (ø)
r/R/arrow-package.R 63.63% <ø> (ø)
r/R/memory_pool.R 0% <0%> (ø)
r/src/recordbatch.cpp 86.5% <100%> (ø)
r/R/list.R 100% <100%> (ø)
r/src/array_from_vector.cpp 78.89% <100%> (ø)
r/R/compute.R 100% <100%> (ø)
r/R/compression.R 100% <100%> (ø)
r/R/json.R 100% <100%> (ø)
r/R/buffer.R 81.81% <100%> (ø)
... and 22 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4f7ead8...0b866c3. Read the comment docs.

@nealrichardson nealrichardson marked this pull request as ready for review September 6, 2019 23:18
@nealrichardson nealrichardson changed the title WIP ARROW-5505: [R] Stop masking base R functions (draft 3) ARROW-5505: [R] Stop masking base R functions (draft 3) Sep 6, 2019
@pitrou
Copy link
Member

pitrou commented Sep 9, 2019

I can't speak about this PR, but as a R neophyte, I find the quoted "vignette" quite useful.

@nealrichardson
Copy link
Member Author

In order to show the documentation changes, I published the pkgdown site from this branch at https://enpiar.com/arrow-r-site/. Decent examples include https://enpiar.com/arrow-r-site/reference/ParquetFileReader.html and https://enpiar.com/arrow-r-site/reference/FeatherTableReader.html

@nealrichardson
Copy link
Member Author

Also preview the vignette at https://enpiar.com/arrow-r-site/articles/arrow.html

@wesm wesm changed the title ARROW-5505: [R] Stop masking base R functions (draft 3) ARROW-5505: [R] Normalize file and class names, stop masking base R functions, add vignette, improve documentation Sep 9, 2019
@nealrichardson
Copy link
Member Author

I don't know what's up with that Ursabot builder but I'm wondering if it's related to d7ef11f ? While this patch is big, it's pretty superficial and shouldn't affect anything build/path related. Travis and Appveyor have been steadily passing on this branch, and those are the ones I pay attention to.

@wesm
Copy link
Member

wesm commented Sep 10, 2019

Could be related, but it's definitely weird because this is passing on master. Maybe rebase?

Note that you're mixing underscore- and hyphen-based file names for the R files, do you want to make them consistent?

Copy link
Contributor

@romainfrancois romainfrancois left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is huge indeed, thanks for doing this. Few comments to follow up on our conversation:

  • R does not have a way to nest namespaces, so we can't have e.g. arrow::csv::FileReader on the R side, the choice of squashing the namespace in the name, camel casing it along the way looks "fine for now" because there aren't many cases, and it's better looking than faux namespacing as in arrow::csv_FileReader. I'm just worried about the future proofing of this, it makes it somewhat harder to map C++ class names with their R stand ins. Given classes are not used "directly" most of the time, maybe going with predictable faux namespacing is more useful 🤷‍♂ ?

  • I guess DataType$dispatch() should be DataType$create() as well.

assert_that(inherits(target_type, "arrow::DataType"))
assert_that(inherits(options, "arrow::compute::CastOptions"))
shared_ptr(`arrow::ChunkedArray`, ChunkedArray__cast(self, target_type, options))
assert_that(inherits(target_type, "DataType"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could leave the fully class names ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean "arrow::DataType" as the R6class(classname=)

}
),
active = list(
null_count = function() ChunkedArray__null_count(self),
num_chunks = function() ChunkedArray__num_chunks(self),
chunks = function() map(ChunkedArray__chunks(self), ~ `arrow::Array`$dispatch(.x)),
type = function() `arrow::DataType`$dispatch(ChunkedArray__type(self))
chunks = function() map(ChunkedArray__chunks(self), ~ Array$create(.x)),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chunks = function() map(ChunkedArray__chunks(self), ~ Array$create(.x)),
chunks = function() map(ChunkedArray__chunks(self), Array$create),

That probably works as well

@nealrichardson
Copy link
Member Author

I understand the buildbot failure now: because there is a vignette, R CMD build is installing the package to build the vignette, and that's failing because LD_LIBRARY_PATH isn't set in the build command, only the check command: https://github.com/ursa-labs/ursabot/blob/master/projects/arrow/arrow/builders.py#L262

Trying to confirm locally now.

@kou
Copy link
Member

kou commented Sep 11, 2019

It seems that "Ursabot / AMD64 Ubuntu 18.04 R" is failed since this merge: https://ci.ursalabs.org/#/builders/96/builds/1246

@nealrichardson
Copy link
Member Author

@kou I believe I fixed that in voltrondata-labs/ursabot#176, but maybe something has to happen for that change to be deployed?

@kou
Copy link
Member

kou commented Sep 11, 2019

Thanks!

pprudhvi pushed a commit to pprudhvi/arrow that referenced this pull request Sep 16, 2019
…unctions, add vignette, improve documentation

The main thrust of the changes are summarized in the new vignette:

> C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as `R6` reference classes, most of which are exported from the namespace.
>
> In order to match the C++ naming conventions, the `R6` classes are in TitleCase, e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`.
>
> Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a `snake_case` alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above.
>
> The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used.

There are a few other fixes and cleanups rolled in here, named in the individual commit messages below.

I stopped short of more documentation consolidation because (1) this patch is already huge and (2) `R6` classes are really tedious to document because it's all manual. I did some searching around and found open issues from 2014 and 2015 about supporting R6 better in roxygen2.

Closes apache#5279 from nealrichardson/cleaner-class-names and squashes the following commits:

3c6f85b <Neal Richardson> 🐀
22c9d04 <Neal Richardson> More doc cleaning
01084ce <Neal Richardson> Factor out assert_is()
caf3265 <Neal Richardson> PR feedback from romain
adf1cf9 <Neal Richardson> File renaming (not case-sensitive)
35f00f5 <Neal Richardson> Rename Table.R to table.R
8bd52d7 <Neal Richardson> Rename Struct.R to struct.R
358290b <Neal Richardson> Rename Schema.R to schema.R
924edd1 <Neal Richardson> Rename List.R to list.R
0150d99 <Neal Richardson> Rename Field.R to field.R
8683f10 <Neal Richardson> Add content to vignette from blog post
e6b75f4 <Neal Richardson> Consolidate and document reader/writer classes; also fix ARROW-6449
495abf6 <Neal Richardson> Fill in documentation and standardize file naming
5fd49ef <Neal Richardson> Fix check failures
96873e1 <Neal Richardson> Factor out make_readable_file
3e4cfe7 <Neal Richardson> Clean up parquet classes and document the R6
85a8d36 <Neal Richardson> Start vignette draft explaining the class and naming conventions
71cac57 <Neal Richardson> Clean up Rd file names, experiment with documenting constructors, and start updating pkgdown
2d1b738 <Neal Richardson> Replace table() with Table()
b694511 <Neal Richardson> Remove defunct Column class
730313e <Neal Richardson> One more find/replace, esp. RecordBatch*
702a0b1 <Neal Richardson> Message
365fedc <Neal Richardson> feather
0e7877b <Neal Richardson> Drop ::ipc::
55607a6 <Neal Richardson> json
9bd708f <Neal Richardson> csv
fbebf27 <Neal Richardson> io
1711d3e <Neal Richardson> CastOptions
12031ad <Neal Richardson> Backfill some  methods
4075897 <Neal Richardson> compression
3b4b492 <Neal Richardson> ChunkedArray
bbf0799 <Neal Richardson> Buffer
3f1cd71 <Neal Richardson> Object
9fbecda <Neal Richardson> A few more backticks
8edf085 <Neal Richardson> Remove more backticks
1f6d154 <Neal Richardson> Replace array() with Array()
9f52490 <Neal Richardson> Progress commit renaming Array

Authored-by: Neal Richardson <[email protected]>
Signed-off-by: Neal Richardson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants