Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use arrow format for datasets [ci skip] #382

Merged
merged 4 commits into from
Oct 5, 2020
Merged

Use arrow format for datasets [ci skip] #382

merged 4 commits into from
Oct 5, 2020

Conversation

dmbates
Copy link
Collaborator

@dmbates dmbates commented Sep 20, 2020

  • Closes Switch to Arrow storage format for TestData #381
  • Convert to using arrow format for saved datasets.
  • CI tests are skipped because some of them assume DataFrame representations of the dataset. Two possible solutions are:
    • change the tests to accept an Arrow.Table instead of a DataFrame
    • keep the tests the way they are except to convert from Arrow.Table to DataFrame when a test expects a DataFrame
    • note that DataFrames is listed as a dependency for test but not for MixedModels itself
  • At present the user must manually add Tables#master and add https://github.com/JuliaData/Arrow.jl#master before using this package, if you want to use the datasets.

@dmbates dmbates requested a review from palday September 20, 2020 16:16
@dmbates
Copy link
Collaborator Author

dmbates commented Sep 20, 2020

Ah, I learned a valuable lesson. I converted some floating-point values, like sleepstudy.reaction from Float64 to Float32 to save space but, in the process, lost accuracy to the extent that tests started failing. Will need to create a new Artifacts.toml but for the time being I will just upload the modified file to the Arrow folder on osf.io.

Copy link
Member

@palday palday left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm reluctant to merge this until the relevant Tables and Arrow versions are released, just because depending on #master will break the CIs.

One other thing I've been thinking about that will both make some things easier and other things harder is to have each dataset be a separate artifact. This will make adding additional datasets easier, but it means that e.g. datasets() will be harder to write.

@dmbates
Copy link
Collaborator Author

dmbates commented Sep 21, 2020

I'm reluctant to merge this until the relevant Tables and Arrow versions are released, just because depending on #master will break the CIs.

That was my intention. Sorry if I did not make that clear.

That is why I used the file name arrow.tar.gz instead of data.tar.gz in the osf.io archive. We can change back to an updated data.tar.gz later if it seems easier to keep a clean archive appearance.

I think that Jacob plans to release Arrow 1.0.0 and the updated Tables this week or early next week. Would you be willing to wait for them before releasing v3.0.0?

The reason why I want to do this is to eliminate indirect dependencies on CategoricalArrays, which is encountering problems with the compiler in both 1.5.2-test and 1.6.0-DEV, when assertions are enabled. It doesn't encounter failures in the release version 1.5.1 because assertions are not enabled in the release version but turning off checks to avoid errors doesn't really make the errors go away.

One other thing I've been thinking about that will both make some things easier and other things harder is to have each dataset be a separate artifact. This will make adding additional datasets easier, but it means that e.g. datasets() will be harder to write.

I think having a separate artifact for each dataset would be overkill. Some of these Arrow files are about 1Kb. I don't think it is a problem to have different versions of the data.tar.gz file accessed by different versions of the MixedModels package. Am I missing something here?

@dmbates
Copy link
Collaborator Author

dmbates commented Sep 21, 2020

BTW, I still haven't been able to come up with a way of accessing a file in Python using

using pyarrow as pa
df = pa.ipc.open_file(...)

because I don't know what open_file expects for its first argument when reading from a file.

@palday
Copy link
Member

palday commented Sep 21, 2020

That is why I used the file name arrow.tar.gz instead of data.tar.gz in the osf.io archive. We can change back to an updated data.tar.gz later if it seems easier to keep a clean archive appearance.

Perfect. :)

I think that Jacob plans to release Arrow 1.0.0 and the updated Tables this week or early next week. Would you be willing to wait for them before releasing v3.0.0?

Yes. I think that actually makes good sense in terms of timing of everything.

The reason why I want to do this is to eliminate indirect dependencies on CategoricalArrays, which is encountering problems with the compiler in both 1.5.2-test and 1.6.0-DEV, which assertions are enabled. It doesn't encounter failures in the release version 1.5.1 because assertions are not enabled in the release version but turning off checks to avoid errors doesn't really make the errors go away.

I saw that. Note that we still have some indirect dependence on CategoricalArrays via StatsModels and how things work for contrasts (although I think those code paths can be avoided by explicitly setting contrasts for grouping terms).

One other thing I've been thinking about that will both make some things easier and other things harder is to have each dataset be a separate artifact. This will make adding additional datasets easier, but it means that e.g. datasets() will be harder to write.

I think having a separate artifact for each dataset would be overkill. Some of these Arrow files are about 1Kb. I don't think it is a problem to have different versions of the data.tar.gz file accessed by different versions of the MixedModels package. Am I missing something here?

Nope, not at all. It wasn't a concern about the overall size as much as making it easier to add in individual datasets. Some parts of moving to a new artifact still seem quite brittle, so the thought was "if the old parts don't change when adding a new dataset, then only the new parts can be affected when everything is separate". I was also tending to the overkill perspective, so then that's easy. 😄

@dmbates
Copy link
Collaborator Author

dmbates commented Sep 21, 2020

My plan is to work through the tests line-by-line reading from a local cache of arrow files until I can get the tests to run then upload the revised arrow files and create a new version of arrow.tar.gz. I hope to only need to do the sha256 dance once more (but it probably won't work out that way).

@dmbates
Copy link
Collaborator Author

dmbates commented Sep 21, 2020

The current version of Arrow.jl does have the arguments in place to compress the columns when writing an arrow file but, as far as I can see, the argument is just passed around and it is never really used. The way that I have been reducing the size of the arrow files is to use PooledArray, which chooses the smallest size of unsigned integer that can be used for the refs, and by manually recoding integers to smaller sizes that preserve the information. As I mentioned above, switching to Float32 from Float64 has a risk of losing accuracy and having tests fail so I think I will avoid that.

@palday
Copy link
Member

palday commented Sep 21, 2020

Regarding reading Arrow files in Python: I also can't figure out how to make it work. I'm particularly cranky on this point because the documentation for Arrow itself (not Arrow.jl) makes it clear that Arrow is a memory format, not a disk format. There are several options for disk formats, but Arrow.jl doesn't specify which one it uses. I've tried everything in the PyArrow and Pandas documentation that I can find, but at best I get an error and at worst I get a segfault (!):

import pyarrow as pa

with open("df.arrow","rb") as f:
    df = pa.ipc.open_file(f)

## -- End pasted text --
WARNING: Logging before InitGoogleLogging() is written to STDERR
F0921 16:33:40.940201 3501032 type.cc:466]  Check failed: int_type.is_signed() dictionary index type should be signed integer
*** Check failure stack trace: ***
Aborted (core dumped)

@dmbates
Copy link
Collaborator Author

dmbates commented Sep 21, 2020

@palday It is no big deal to read the Arrow files in Python/Pandas. I thought that if it was easy to do then it would be a good check. It is peculiar that it is not that easy because Feather and, I think, Arrow were developed by Wes McKinney with Python/Pandas as the preferred implementation.

I have succeeded in writing Arrow from R (with the peculiarly named arrow::write_feather) and reading in Julia but the subset of Arrow that can be written by R and read by R is limited. I am fine with just treating Arrow as an archival format for ColumnTable types for the time being.

@dmbates
Copy link
Collaborator Author

dmbates commented Sep 21, 2020

The error message from the Python code seems to object to the file being written with DictEncoding (categorical or pooled arrays) using unsigned integers, which is the default in PooledArrays. I guess at some time we could try creating PooledArrays with the ref type being, say, Int16 instead of UInt16. Seems peculiar to restrict refs, which are index values, to signed integers.

@palday
Copy link
Member

palday commented Sep 21, 2020

Python doesn't have a native unsigned type (I think -- I've certainly never used it).

I'm mostly just grumpy about 'universal' formats being less than universal. I know it's a hard problem, especially if the goal is to make blitting into memory possible.

@dmbates
Copy link
Collaborator Author

dmbates commented Sep 22, 2020

The d2463d5 version passes tests locally. We will need to wait for Jacob to release Arrow v1.0.0. I encountered a "round-trip" problem with one of the datasets apache/arrow-julia#24 We hope that won't be too much of a hang-up.

@dmbates
Copy link
Collaborator Author

dmbates commented Sep 22, 2020

Python doesn't have a native unsigned type (I think -- I've certainly never used it).

I'm mostly just grumpy about 'universal' formats being less than universal. I know it's a hard problem, especially if the goal is to make blitting into memory possible.

The idea of 'universal' to me would imply that systems with fewer choices of integer and floating-point representations would make a good faith effort to convert representations they do not support to representations they do support. Of course, that may not be possible if, say, R gets data in a 64-bit integer representation. I guess we will need to use RCall or PyCall for a while still when pushing Julia data tables to other systems.

I do notice that when I write an arrow file from an R data.frame then read it into Julia I usually need to apply disallowmissing! to it.

@dmbates
Copy link
Collaborator Author

dmbates commented Sep 29, 2020

I have checked that this version passes tests when run by hand. @quinnj is making substantial enhancements in Arrow.jl to allow for compression. The Zstd compression works well with some of our larger data sets

-rw-rw-r-- 1 bates bates  585890 Sep 29 10:51 d3a.arrow
-rw-rw-r-- 1 bates bates  844314 Sep 19 18:19 d3.arrow
-rw-rw-r-- 1 bates bates  200570 Sep 29 10:55 instevala.arrow
-rw-rw-r-- 1 bates bates  700874 Sep 19 18:57 insteval.arrow
-rw-rw-r-- 1 bates bates 5080346 Sep 19 21:25 ml1m.arrow
-rw-rw-r-- 1 bates bates 1623002 Sep 27 12:06 ml1mb.arrow

The last one is particularly impressive because there are over 1 million rows in ml1m (MovieLens sample of 1 million ratings) with each row a total of 5 bytes plus overhead and it gets compressed to 1.6 MB

I think it will be worthwhile waiting until @quinnj is ready to release Arrow.jl v1.0.0

@dmbates
Copy link
Collaborator Author

dmbates commented Oct 2, 2020

I have a new set of .arrow files ready to go. These use the :zstd compression on the larger files. They require the master branch of https://github.com/JuliaData/Arrow.jl to perform the decompression. I plan to wait until Jacob releases Arrow v1.0.0 before uploading them and recreating Artifacts.toml.

@dmbates
Copy link
Collaborator Author

dmbates commented Oct 4, 2020

Jacob has now released Arrow v0.3.0 so I will work on getting this PR ready to go today.

@dmbates
Copy link
Collaborator Author

dmbates commented Oct 4, 2020

BTW, I still haven't been able to come up with a way of accessing a file in Python using

using pyarrow as pa
df = pa.ipc.open_file(...)

because I don't know what open_file expects for its first argument when reading from a file.

In the tests subdirectory of the Arrow package Jacob has examples of round-tripping Arrow to Python and back.

@codecov
Copy link

codecov bot commented Oct 4, 2020

Codecov Report

Merging #382 into master will increase coverage by 0.42%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #382      +/-   ##
==========================================
+ Coverage   94.90%   95.33%   +0.42%     
==========================================
  Files          23       23              
  Lines        1591     1606      +15     
==========================================
+ Hits         1510     1531      +21     
+ Misses         81       75       -6     
Impacted Files Coverage Δ
src/MixedModels.jl 100.00% <ø> (ø)
src/linearmixedmodel.jl 98.87% <100.00%> (-0.04%) ⬇️
src/utilities.jl 98.95% <100.00%> (+0.20%) ⬆️
src/femat.jl 100.00% <0.00%> (ø)
src/bootstrap.jl 98.09% <0.00%> (ø)
src/randomeffectsterm.jl 94.11% <0.00%> (ø)
src/remat.jl 95.37% <0.00%> (+0.11%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 99935f3...e36d657. Read the comment docs.

@dmbates
Copy link
Collaborator Author

dmbates commented Oct 4, 2020

This version passes tests, except that it hangs up on the v.1.6.0-DEV tests. I have a new upload of the datasets to the osf.io site and a version 3 of arrow.tar.gz but using that one fails. The d3.arrow file seems to cause problems with the levels of one of the factors, h. Trying to display the table results in an out-of-memory error.

@palday Can we squash and merge this version and worry about uploading new copies of the arrow files later? To me this is the big blocker for a release 3.

@@ -68,7 +68,8 @@ function LinearMixedModel(
y, Xs = modelcols(form, tbl)

y = reshape(float(y), (:, 1)) # y as a floating-point matrix
T = eltype(y)
T = promote_type(Float64, eltype(y)) # ensure that eltype of model matrices is at least Float64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mention this in the docs anywhere? In some areas, there is a tendency to use half precision to speed things up (although that matters less for x86-64 and modern ARM, it can make a difference on GPU, which we of course don't use).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is my last comment, then I'm happy to merge)

@dmbates dmbates merged commit b7402e2 into master Oct 5, 2020
@dmbates dmbates deleted the arrowconversion branch October 5, 2020 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Switch to Arrow storage format for TestData
2 participants