-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-5505: [R] Normalize file and class names, stop masking base R functions, add vignette, improve documentation #5279
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5279 +/- ##
=========================================
Coverage ? 76.06%
=========================================
Files ? 56
Lines ? 3572
Branches ? 0
=========================================
Hits ? 2717
Misses ? 855
Partials ? 0
Continue to review full report at Codecov.
|
1b604b9
to
abb3d5d
Compare
I can't speak about this PR, but as a R neophyte, I find the quoted "vignette" quite useful. |
In order to show the documentation changes, I published the pkgdown site from this branch at https://enpiar.com/arrow-r-site/. Decent examples include https://enpiar.com/arrow-r-site/reference/ParquetFileReader.html and https://enpiar.com/arrow-r-site/reference/FeatherTableReader.html |
Also preview the vignette at https://enpiar.com/arrow-r-site/articles/arrow.html |
fc78832
to
e555008
Compare
e555008
to
0b866c3
Compare
I don't know what's up with that Ursabot builder but I'm wondering if it's related to d7ef11f ? While this patch is big, it's pretty superficial and shouldn't affect anything build/path related. Travis and Appveyor have been steadily passing on this branch, and those are the ones I pay attention to. |
Could be related, but it's definitely weird because this is passing on master. Maybe rebase? Note that you're mixing underscore- and hyphen-based file names for the R files, do you want to make them consistent? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is huge indeed, thanks for doing this. Few comments to follow up on our conversation:
-
R does not have a way to nest namespaces, so we can't have e.g.
arrow::csv::FileReader
on the R side, the choice of squashing the namespace in the name, camel casing it along the way looks "fine for now" because there aren't many cases, and it's better looking than faux namespacing as inarrow::csv_FileReader
. I'm just worried about the future proofing of this, it makes it somewhat harder to map C++ class names with their R stand ins. Given classes are not used "directly" most of the time, maybe going with predictable faux namespacing is more useful 🤷♂ ? -
I guess
DataType$dispatch()
should beDataType$create()
as well.
r/R/chunked-array.R
Outdated
assert_that(inherits(target_type, "arrow::DataType")) | ||
assert_that(inherits(options, "arrow::compute::CastOptions")) | ||
shared_ptr(`arrow::ChunkedArray`, ChunkedArray__cast(self, target_type, options)) | ||
assert_that(inherits(target_type, "DataType")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could leave the fully class names ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean "arrow::DataType"
as the R6class(classname=)
r/R/chunked-array.R
Outdated
} | ||
), | ||
active = list( | ||
null_count = function() ChunkedArray__null_count(self), | ||
num_chunks = function() ChunkedArray__num_chunks(self), | ||
chunks = function() map(ChunkedArray__chunks(self), ~ `arrow::Array`$dispatch(.x)), | ||
type = function() `arrow::DataType`$dispatch(ChunkedArray__type(self)) | ||
chunks = function() map(ChunkedArray__chunks(self), ~ Array$create(.x)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunks = function() map(ChunkedArray__chunks(self), ~ Array$create(.x)), | |
chunks = function() map(ChunkedArray__chunks(self), Array$create), |
That probably works as well
I understand the buildbot failure now: because there is a vignette, Trying to confirm locally now. |
… start updating pkgdown
0b866c3
to
adf1cf9
Compare
It seems that "Ursabot / AMD64 Ubuntu 18.04 R" is failed since this merge: https://ci.ursalabs.org/#/builders/96/builds/1246 |
@kou I believe I fixed that in voltrondata-labs/ursabot#176, but maybe something has to happen for that change to be deployed? |
Thanks! |
…unctions, add vignette, improve documentation The main thrust of the changes are summarized in the new vignette: > C++ is an object-oriented language, so the core logic of the Arrow library is encapsulated in classes and methods. In the R package, these classes are implemented as `R6` reference classes, most of which are exported from the namespace. > > In order to match the C++ naming conventions, the `R6` classes are in TitleCase, e.g. `RecordBatch`. This makes it easy to look up the relevant C++ implementations in the [code](https://github.com/apache/arrow/tree/master/cpp) or [documentation](https://arrow.apache.org/docs/cpp/). To simplify things in R, the C++ library namespaces are generally dropped or flattened; that is, where the C++ library has `arrow::io::FileOutputStream`, it is just `FileOutputStream` in the R package. One exception is for the file readers, where the namespace is necessary to disambiguate. So `arrow::csv::TableReader` becomes `CsvTableReader`, and `arrow::json::TableReader` becomes `JsonTableReader`. > > Some of these classes are not meant to be instantiated directly; they may be base classes or other kinds of helpers. For those that you should be able to create, use the `$create()` method to instantiate an object. For example, `rb <- RecordBatch$create(int = 1:10, dbl = as.numeric(1:10))` will create a `RecordBatch`. Many of these factory methods that an R user might most often encounter also have a `snake_case` alias, in order to be more familiar for contemporary R users. So `record_batch(int = 1:10, dbl = as.numeric(1:10))` would do the same as `RecordBatch$create()` above. > > The typical user of the `arrow` R package may never deal directly with the `R6` objects. We provide more R-friendly wrapper functions as a higher-level interface to the C++ library. An R user can call `read_parquet()` without knowing or caring that they're instantiating a `ParquetFileReader` object and calling the `$ReadFile()` method on it. The classes are there and available to the advanced programmer who wants fine-grained control over how the C++ library is used. There are a few other fixes and cleanups rolled in here, named in the individual commit messages below. I stopped short of more documentation consolidation because (1) this patch is already huge and (2) `R6` classes are really tedious to document because it's all manual. I did some searching around and found open issues from 2014 and 2015 about supporting R6 better in roxygen2. Closes apache#5279 from nealrichardson/cleaner-class-names and squashes the following commits: 3c6f85b <Neal Richardson> 🐀 22c9d04 <Neal Richardson> More doc cleaning 01084ce <Neal Richardson> Factor out assert_is() caf3265 <Neal Richardson> PR feedback from romain adf1cf9 <Neal Richardson> File renaming (not case-sensitive) 35f00f5 <Neal Richardson> Rename Table.R to table.R 8bd52d7 <Neal Richardson> Rename Struct.R to struct.R 358290b <Neal Richardson> Rename Schema.R to schema.R 924edd1 <Neal Richardson> Rename List.R to list.R 0150d99 <Neal Richardson> Rename Field.R to field.R 8683f10 <Neal Richardson> Add content to vignette from blog post e6b75f4 <Neal Richardson> Consolidate and document reader/writer classes; also fix ARROW-6449 495abf6 <Neal Richardson> Fill in documentation and standardize file naming 5fd49ef <Neal Richardson> Fix check failures 96873e1 <Neal Richardson> Factor out make_readable_file 3e4cfe7 <Neal Richardson> Clean up parquet classes and document the R6 85a8d36 <Neal Richardson> Start vignette draft explaining the class and naming conventions 71cac57 <Neal Richardson> Clean up Rd file names, experiment with documenting constructors, and start updating pkgdown 2d1b738 <Neal Richardson> Replace table() with Table() b694511 <Neal Richardson> Remove defunct Column class 730313e <Neal Richardson> One more find/replace, esp. RecordBatch* 702a0b1 <Neal Richardson> Message 365fedc <Neal Richardson> feather 0e7877b <Neal Richardson> Drop ::ipc:: 55607a6 <Neal Richardson> json 9bd708f <Neal Richardson> csv fbebf27 <Neal Richardson> io 1711d3e <Neal Richardson> CastOptions 12031ad <Neal Richardson> Backfill some methods 4075897 <Neal Richardson> compression 3b4b492 <Neal Richardson> ChunkedArray bbf0799 <Neal Richardson> Buffer 3f1cd71 <Neal Richardson> Object 9fbecda <Neal Richardson> A few more backticks 8edf085 <Neal Richardson> Remove more backticks 1f6d154 <Neal Richardson> Replace array() with Array() 9f52490 <Neal Richardson> Progress commit renaming Array Authored-by: Neal Richardson <[email protected]> Signed-off-by: Neal Richardson <[email protected]>
The main thrust of the changes are summarized in the new vignette:
There are a few other fixes and cleanups rolled in here, named in the individual commit messages below.
I stopped short of more documentation consolidation because (1) this patch is already huge and (2)
R6
classes are really tedious to document because it's all manual. I did some searching around and found open issues from 2014 and 2015 about supporting R6 better in roxygen2.