-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use arrow format for datasets [ci skip] #382
Conversation
Ah, I learned a valuable lesson. I converted some floating-point values, like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm reluctant to merge this until the relevant Tables
and Arrow
versions are released, just because depending on #master
will break the CIs.
One other thing I've been thinking about that will both make some things easier and other things harder is to have each dataset be a separate artifact. This will make adding additional datasets easier, but it means that e.g. datasets()
will be harder to write.
That was my intention. Sorry if I did not make that clear. That is why I used the file name arrow.tar.gz instead of data.tar.gz in the osf.io archive. We can change back to an updated data.tar.gz later if it seems easier to keep a clean archive appearance. I think that Jacob plans to release Arrow 1.0.0 and the updated Tables this week or early next week. Would you be willing to wait for them before releasing v3.0.0? The reason why I want to do this is to eliminate indirect dependencies on CategoricalArrays, which is encountering problems with the compiler in both 1.5.2-test and 1.6.0-DEV, when assertions are enabled. It doesn't encounter failures in the release version 1.5.1 because assertions are not enabled in the release version but turning off checks to avoid errors doesn't really make the errors go away.
I think having a separate artifact for each dataset would be overkill. Some of these Arrow files are about 1Kb. I don't think it is a problem to have different versions of the data.tar.gz file accessed by different versions of the MixedModels package. Am I missing something here? |
BTW, I still haven't been able to come up with a way of accessing a file in Python using using pyarrow as pa
df = pa.ipc.open_file(...) because I don't know what |
Perfect. :)
Yes. I think that actually makes good sense in terms of timing of everything.
I saw that. Note that we still have some indirect dependence on
Nope, not at all. It wasn't a concern about the overall size as much as making it easier to add in individual datasets. Some parts of moving to a new artifact still seem quite brittle, so the thought was "if the old parts don't change when adding a new dataset, then only the new parts can be affected when everything is separate". I was also tending to the overkill perspective, so then that's easy. 😄 |
My plan is to work through the tests line-by-line reading from a local cache of arrow files until I can get the tests to run then upload the revised arrow files and create a new version of arrow.tar.gz. I hope to only need to do the sha256 dance once more (but it probably won't work out that way). |
The current version of Arrow.jl does have the arguments in place to compress the columns when writing an arrow file but, as far as I can see, the argument is just passed around and it is never really used. The way that I have been reducing the size of the arrow files is to use |
Regarding reading Arrow files in Python: I also can't figure out how to make it work. I'm particularly cranky on this point because the documentation for Arrow itself (not Arrow.jl) makes it clear that Arrow is a memory format, not a disk format. There are several options for disk formats, but Arrow.jl doesn't specify which one it uses. I've tried everything in the PyArrow and Pandas documentation that I can find, but at best I get an error and at worst I get a segfault (!): import pyarrow as pa
with open("df.arrow","rb") as f:
df = pa.ipc.open_file(f)
## -- End pasted text --
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0921 16:33:40.940201 3501032 type.cc:466] Check failed: int_type.is_signed() dictionary index type should be signed integer
*** Check failure stack trace: ***
Aborted (core dumped) |
@palday It is no big deal to read the Arrow files in Python/Pandas. I thought that if it was easy to do then it would be a good check. It is peculiar that it is not that easy because Feather and, I think, Arrow were developed by Wes McKinney with Python/Pandas as the preferred implementation. I have succeeded in writing Arrow from R (with the peculiarly named arrow::write_feather) and reading in Julia but the subset of Arrow that can be written by R and read by R is limited. I am fine with just treating Arrow as an archival format for ColumnTable types for the time being. |
The error message from the Python code seems to object to the file being written with DictEncoding (categorical or pooled arrays) using unsigned integers, which is the default in PooledArrays. I guess at some time we could try creating PooledArrays with the ref type being, say, Int16 instead of UInt16. Seems peculiar to restrict refs, which are index values, to signed integers. |
Python doesn't have a native unsigned type (I think -- I've certainly never used it). I'm mostly just grumpy about 'universal' formats being less than universal. I know it's a hard problem, especially if the goal is to make blitting into memory possible. |
The d2463d5 version passes tests locally. We will need to wait for Jacob to release Arrow v1.0.0. I encountered a "round-trip" problem with one of the datasets apache/arrow-julia#24 We hope that won't be too much of a hang-up. |
The idea of 'universal' to me would imply that systems with fewer choices of integer and floating-point representations would make a good faith effort to convert representations they do not support to representations they do support. Of course, that may not be possible if, say, R gets data in a 64-bit integer representation. I guess we will need to use RCall or PyCall for a while still when pushing Julia data tables to other systems. I do notice that when I write an arrow file from an R data.frame then read it into Julia I usually need to apply |
I have checked that this version passes tests when run by hand. @quinnj is making substantial enhancements in
The last one is particularly impressive because there are over 1 million rows in I think it will be worthwhile waiting until @quinnj is ready to release Arrow.jl v1.0.0 |
I have a new set of |
Jacob has now released Arrow v0.3.0 so I will work on getting this PR ready to go today. |
In the tests subdirectory of the Arrow package Jacob has examples of round-tripping Arrow to Python and back. |
Codecov Report
@@ Coverage Diff @@
## master #382 +/- ##
==========================================
+ Coverage 94.90% 95.33% +0.42%
==========================================
Files 23 23
Lines 1591 1606 +15
==========================================
+ Hits 1510 1531 +21
+ Misses 81 75 -6
Continue to review full report at Codecov.
|
This version passes tests, except that it hangs up on the v.1.6.0-DEV tests. I have a new upload of the datasets to the osf.io site and a version 3 of @palday Can we squash and merge this version and worry about uploading new copies of the arrow files later? To me this is the big blocker for a release 3. |
@@ -68,7 +68,8 @@ function LinearMixedModel( | |||
y, Xs = modelcols(form, tbl) | |||
|
|||
y = reshape(float(y), (:, 1)) # y as a floating-point matrix | |||
T = eltype(y) | |||
T = promote_type(Float64, eltype(y)) # ensure that eltype of model matrices is at least Float64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mention this in the docs anywhere? In some areas, there is a tendency to use half precision to speed things up (although that matters less for x86-64 and modern ARM, it can make a difference on GPU, which we of course don't use).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is my last comment, then I'm happy to merge)
Arrow.Table
instead of aDataFrame
Arrow.Table
toDataFrame
when a test expects aDataFrame
DataFrames
is listed as a dependency fortest
but not forMixedModels
itselfadd Tables#master
andadd https://github.com/JuliaData/Arrow.jl#master
before using this package, if you want to use the datasets.