Use arrow format for datasets [ci skip] #382

dmbates · 2020-09-20T16:15:56Z

Closes Switch to Arrow storage format for TestData #381
Convert to using arrow format for saved datasets.
CI tests are skipped because some of them assume DataFrame representations of the dataset. Two possible solutions are:
- change the tests to accept an Arrow.Table instead of a DataFrame
- keep the tests the way they are except to convert from Arrow.Table to DataFrame when a test expects a DataFrame
- note that DataFrames is listed as a dependency for test but not for MixedModels itself
At present the user must manually add Tables#master and add https://github.com/JuliaData/Arrow.jl#master before using this package, if you want to use the datasets.

dmbates · 2020-09-20T21:04:02Z

Ah, I learned a valuable lesson. I converted some floating-point values, like sleepstudy.reaction from Float64 to Float32 to save space but, in the process, lost accuracy to the extent that tests started failing. Will need to create a new Artifacts.toml but for the time being I will just upload the modified file to the Arrow folder on osf.io.

palday

I'm reluctant to merge this until the relevant Tables and Arrow versions are released, just because depending on #master will break the CIs.

One other thing I've been thinking about that will both make some things easier and other things harder is to have each dataset be a separate artifact. This will make adding additional datasets easier, but it means that e.g. datasets() will be harder to write.

dmbates · 2020-09-21T14:18:48Z

I'm reluctant to merge this until the relevant Tables and Arrow versions are released, just because depending on #master will break the CIs.

That was my intention. Sorry if I did not make that clear.

That is why I used the file name arrow.tar.gz instead of data.tar.gz in the osf.io archive. We can change back to an updated data.tar.gz later if it seems easier to keep a clean archive appearance.

I think that Jacob plans to release Arrow 1.0.0 and the updated Tables this week or early next week. Would you be willing to wait for them before releasing v3.0.0?

The reason why I want to do this is to eliminate indirect dependencies on CategoricalArrays, which is encountering problems with the compiler in both 1.5.2-test and 1.6.0-DEV, when assertions are enabled. It doesn't encounter failures in the release version 1.5.1 because assertions are not enabled in the release version but turning off checks to avoid errors doesn't really make the errors go away.

One other thing I've been thinking about that will both make some things easier and other things harder is to have each dataset be a separate artifact. This will make adding additional datasets easier, but it means that e.g. datasets() will be harder to write.

I think having a separate artifact for each dataset would be overkill. Some of these Arrow files are about 1Kb. I don't think it is a problem to have different versions of the data.tar.gz file accessed by different versions of the MixedModels package. Am I missing something here?

dmbates · 2020-09-21T14:20:43Z

BTW, I still haven't been able to come up with a way of accessing a file in Python using

using pyarrow as pa
df = pa.ipc.open_file(...)

because I don't know what open_file expects for its first argument when reading from a file.

palday · 2020-09-21T14:29:29Z

That is why I used the file name arrow.tar.gz instead of data.tar.gz in the osf.io archive. We can change back to an updated data.tar.gz later if it seems easier to keep a clean archive appearance.

Perfect. :)

I think that Jacob plans to release Arrow 1.0.0 and the updated Tables this week or early next week. Would you be willing to wait for them before releasing v3.0.0?

Yes. I think that actually makes good sense in terms of timing of everything.

The reason why I want to do this is to eliminate indirect dependencies on CategoricalArrays, which is encountering problems with the compiler in both 1.5.2-test and 1.6.0-DEV, which assertions are enabled. It doesn't encounter failures in the release version 1.5.1 because assertions are not enabled in the release version but turning off checks to avoid errors doesn't really make the errors go away.

I saw that. Note that we still have some indirect dependence on CategoricalArrays via StatsModels and how things work for contrasts (although I think those code paths can be avoided by explicitly setting contrasts for grouping terms).

One other thing I've been thinking about that will both make some things easier and other things harder is to have each dataset be a separate artifact. This will make adding additional datasets easier, but it means that e.g. datasets() will be harder to write.

I think having a separate artifact for each dataset would be overkill. Some of these Arrow files are about 1Kb. I don't think it is a problem to have different versions of the data.tar.gz file accessed by different versions of the MixedModels package. Am I missing something here?

Nope, not at all. It wasn't a concern about the overall size as much as making it easier to add in individual datasets. Some parts of moving to a new artifact still seem quite brittle, so the thought was "if the old parts don't change when adding a new dataset, then only the new parts can be affected when everything is separate". I was also tending to the overkill perspective, so then that's easy. 😄

dmbates · 2020-09-21T14:38:43Z

My plan is to work through the tests line-by-line reading from a local cache of arrow files until I can get the tests to run then upload the revised arrow files and create a new version of arrow.tar.gz. I hope to only need to do the sha256 dance once more (but it probably won't work out that way).

dmbates · 2020-09-21T14:42:26Z

The current version of Arrow.jl does have the arguments in place to compress the columns when writing an arrow file but, as far as I can see, the argument is just passed around and it is never really used. The way that I have been reducing the size of the arrow files is to use PooledArray, which chooses the smallest size of unsigned integer that can be used for the refs, and by manually recoding integers to smaller sizes that preserve the information. As I mentioned above, switching to Float32 from Float64 has a risk of losing accuracy and having tests fail so I think I will avoid that.

palday · 2020-09-21T14:44:10Z

Regarding reading Arrow files in Python: I also can't figure out how to make it work. I'm particularly cranky on this point because the documentation for Arrow itself (not Arrow.jl) makes it clear that Arrow is a memory format, not a disk format. There are several options for disk formats, but Arrow.jl doesn't specify which one it uses. I've tried everything in the PyArrow and Pandas documentation that I can find, but at best I get an error and at worst I get a segfault (!):

import pyarrow as pa

with open("df.arrow","rb") as f:
    df = pa.ipc.open_file(f)

## -- End pasted text --
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0921 16:33:40.940201 3501032 type.cc:466]  Check failed: int_type.is_signed() dictionary index type should be signed integer
*** Check failure stack trace: ***
Aborted (core dumped)

dmbates · 2020-09-21T16:33:48Z

@palday It is no big deal to read the Arrow files in Python/Pandas. I thought that if it was easy to do then it would be a good check. It is peculiar that it is not that easy because Feather and, I think, Arrow were developed by Wes McKinney with Python/Pandas as the preferred implementation.

I have succeeded in writing Arrow from R (with the peculiarly named arrow::write_feather) and reading in Julia but the subset of Arrow that can be written by R and read by R is limited. I am fine with just treating Arrow as an archival format for ColumnTable types for the time being.

dmbates · 2020-09-21T16:38:13Z

The error message from the Python code seems to object to the file being written with DictEncoding (categorical or pooled arrays) using unsigned integers, which is the default in PooledArrays. I guess at some time we could try creating PooledArrays with the ref type being, say, Int16 instead of UInt16. Seems peculiar to restrict refs, which are index values, to signed integers.

palday · 2020-09-21T16:45:59Z

Python doesn't have a native unsigned type (I think -- I've certainly never used it).

I'm mostly just grumpy about 'universal' formats being less than universal. I know it's a hard problem, especially if the goal is to make blitting into memory possible.

dmbates · 2020-09-22T19:43:24Z

The d2463d5 version passes tests locally. We will need to wait for Jacob to release Arrow v1.0.0. I encountered a "round-trip" problem with one of the datasets apache/arrow-julia#24 We hope that won't be too much of a hang-up.

dmbates · 2020-09-22T19:47:58Z

Python doesn't have a native unsigned type (I think -- I've certainly never used it).

I'm mostly just grumpy about 'universal' formats being less than universal. I know it's a hard problem, especially if the goal is to make blitting into memory possible.

The idea of 'universal' to me would imply that systems with fewer choices of integer and floating-point representations would make a good faith effort to convert representations they do not support to representations they do support. Of course, that may not be possible if, say, R gets data in a 64-bit integer representation. I guess we will need to use RCall or PyCall for a while still when pushing Julia data tables to other systems.

I do notice that when I write an arrow file from an R data.frame then read it into Julia I usually need to apply disallowmissing! to it.

dmbates · 2020-09-29T16:49:06Z

I have checked that this version passes tests when run by hand. @quinnj is making substantial enhancements in Arrow.jl to allow for compression. The Zstd compression works well with some of our larger data sets

-rw-rw-r-- 1 bates bates  585890 Sep 29 10:51 d3a.arrow
-rw-rw-r-- 1 bates bates  844314 Sep 19 18:19 d3.arrow
-rw-rw-r-- 1 bates bates  200570 Sep 29 10:55 instevala.arrow
-rw-rw-r-- 1 bates bates  700874 Sep 19 18:57 insteval.arrow
-rw-rw-r-- 1 bates bates 5080346 Sep 19 21:25 ml1m.arrow
-rw-rw-r-- 1 bates bates 1623002 Sep 27 12:06 ml1mb.arrow

The last one is particularly impressive because there are over 1 million rows in ml1m (MovieLens sample of 1 million ratings) with each row a total of 5 bytes plus overhead and it gets compressed to 1.6 MB

I think it will be worthwhile waiting until @quinnj is ready to release Arrow.jl v1.0.0

dmbates · 2020-10-02T16:12:56Z

I have a new set of .arrow files ready to go. These use the :zstd compression on the larger files. They require the master branch of https://github.com/JuliaData/Arrow.jl to perform the decompression. I plan to wait until Jacob releases Arrow v1.0.0 before uploading them and recreating Artifacts.toml.

dmbates · 2020-10-04T14:08:54Z

Jacob has now released Arrow v0.3.0 so I will work on getting this PR ready to go today.

dmbates · 2020-10-04T15:31:58Z

BTW, I still haven't been able to come up with a way of accessing a file in Python using
using pyarrow as pa
df = pa.ipc.open_file(...)
because I don't know what open_file expects for its first argument when reading from a file.

In the tests subdirectory of the Arrow package Jacob has examples of round-tripping Arrow to Python and back.

codecov · 2020-10-04T15:40:13Z

Codecov Report

Merging #382 into master will increase coverage by 0.42%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #382      +/-   ##
==========================================
+ Coverage   94.90%   95.33%   +0.42%     
==========================================
  Files          23       23              
  Lines        1591     1606      +15     
==========================================
+ Hits         1510     1531      +21     
+ Misses         81       75       -6

Impacted Files	Coverage Δ
src/MixedModels.jl	`100.00% <ø> (ø)`
src/linearmixedmodel.jl	`98.87% <100.00%> (-0.04%)`	⬇️
src/utilities.jl	`98.95% <100.00%> (+0.20%)`	⬆️
src/femat.jl	`100.00% <0.00%> (ø)`
src/bootstrap.jl	`98.09% <0.00%> (ø)`
src/randomeffectsterm.jl	`94.11% <0.00%> (ø)`
src/remat.jl	`95.37% <0.00%> (+0.11%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99935f3...e36d657. Read the comment docs.

dmbates · 2020-10-04T17:38:12Z

This version passes tests, except that it hangs up on the v.1.6.0-DEV tests. I have a new upload of the datasets to the osf.io site and a version 3 of arrow.tar.gz but using that one fails. The d3.arrow file seems to cause problems with the levels of one of the factors, h. Trying to display the table results in an out-of-memory error.

@palday Can we squash and merge this version and worry about uploading new copies of the arrow files later? To me this is the big blocker for a release 3.

palday · 2020-10-04T18:31:22Z

src/linearmixedmodel.jl

@@ -68,7 +68,8 @@ function LinearMixedModel(
    y, Xs = modelcols(form, tbl)

    y = reshape(float(y), (:, 1)) # y as a floating-point matrix
-    T = eltype(y)
+    T = promote_type(Float64, eltype(y))  # ensure that eltype of model matrices is at least Float64


Should we mention this in the docs anywhere? In some areas, there is a tendency to use half precision to speed things up (although that matters less for x86-64 and modern ARM, it can make a difference on GPU, which we of course don't use).

(this is my last comment, then I'm happy to merge)

Use arrow format for datasets [ci skip]

b58df04

dmbates requested a review from palday September 20, 2020 16:16

palday reviewed Sep 21, 2020

View reviewed changes

dmbates added 2 commits September 22, 2020 14:12

New arrow archive, use DataFrames explicitly

7f82976

One more DataFrame conversion

d2463d5

Arrow version at 0.3

e36d657

palday reviewed Oct 4, 2020

View reviewed changes

palday approved these changes Oct 4, 2020

View reviewed changes

dmbates merged commit b7402e2 into master Oct 5, 2020

dmbates deleted the arrowconversion branch October 5, 2020 14:32

palday mentioned this pull request Oct 5, 2020

Testing on julia-v1.6.0-DEV #387

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use arrow format for datasets [ci skip] #382

Use arrow format for datasets [ci skip] #382

dmbates commented Sep 20, 2020 •

edited

Loading

dmbates commented Sep 20, 2020

palday left a comment

dmbates commented Sep 21, 2020 •

edited

Loading

dmbates commented Sep 21, 2020

palday commented Sep 21, 2020

dmbates commented Sep 21, 2020

dmbates commented Sep 21, 2020

palday commented Sep 21, 2020

dmbates commented Sep 21, 2020

dmbates commented Sep 21, 2020

palday commented Sep 21, 2020

dmbates commented Sep 22, 2020

dmbates commented Sep 22, 2020

dmbates commented Sep 29, 2020

dmbates commented Oct 2, 2020

dmbates commented Oct 4, 2020

dmbates commented Oct 4, 2020

codecov bot commented Oct 4, 2020 •

edited

Loading

dmbates commented Oct 4, 2020

palday Oct 4, 2020

palday Oct 4, 2020

Use arrow format for datasets [ci skip] #382

Use arrow format for datasets [ci skip] #382

Conversation

dmbates commented Sep 20, 2020 • edited Loading

dmbates commented Sep 20, 2020

palday left a comment

Choose a reason for hiding this comment

dmbates commented Sep 21, 2020 • edited Loading

dmbates commented Sep 21, 2020

palday commented Sep 21, 2020

dmbates commented Sep 21, 2020

dmbates commented Sep 21, 2020

palday commented Sep 21, 2020

dmbates commented Sep 21, 2020

dmbates commented Sep 21, 2020

palday commented Sep 21, 2020

dmbates commented Sep 22, 2020

dmbates commented Sep 22, 2020

dmbates commented Sep 29, 2020

dmbates commented Oct 2, 2020

dmbates commented Oct 4, 2020

dmbates commented Oct 4, 2020

codecov bot commented Oct 4, 2020 • edited Loading

Codecov Report

dmbates commented Oct 4, 2020

palday Oct 4, 2020

Choose a reason for hiding this comment

palday Oct 4, 2020

Choose a reason for hiding this comment

dmbates commented Sep 20, 2020 •

edited

Loading

dmbates commented Sep 21, 2020 •

edited

Loading

codecov bot commented Oct 4, 2020 •

edited

Loading