ARROW-10636: [Rust][Parquet] Switch to Rust Stable by removing specialization in parquet #8698

GregBowyer · 2020-11-18T03:18:18Z

https://issues.apache.org/jira/browse/ARROW-10636

This is a very initial attempt at removing the specialization features from the Rust Parquet implementation.

The specialisation is too complex to be covered by min_specialization and requires a bit of reworking in the crate.

Right now the code dispatches in sub-traits and methods on the Parquet type, and uses a combination of trait abuse, macros and transmutes to eliminate the feature.

I have broken this up into several commits ranging from the simplest removals (which could probably be taken fairly easily) to the most ugly and complex.

I am not stoked on the transmute abuse, and I think another take (or follow up) should be taken to remove as many as possible in the code.

The general trait for DataType::T has been made a private sealed trait to make it impossible to implement external to the Parquet crate, this is intentional as I dont think many of the public interfaces are sensible for end users to be able to implement.

TODO:

Purge the added std::mem::transmutes if possible
Refine and rationalise the unimplemented! implementations
Performance test?
Rebase & Relabel commits with JIRA number

github-actions · 2020-11-18T03:31:00Z

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

GregBowyer · 2020-11-18T03:31:11Z

rust/parquet/src/data_type.rs

@@ -95,6 +96,12 @@ impl From<Vec<u32>> for Int96 {
    }
 }

+impl fmt::Display for Int96 {
+    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
+        write!(f, "{:?}", self.data())


Minor: Maybe as a followup this should print the number rather than its raw bytes?

github-actions · 2020-11-18T03:56:25Z

https://issues.apache.org/jira/browse/ARROW-10636

GregBowyer · 2020-11-18T04:27:25Z

I think I have a solution with std::any::Any for most of the transmutes

sunchao · 2020-11-18T06:04:22Z

Thanks @GregBowyer for the PR. It's will be awesome if we can get this done. I think before merging this we should get some benchmark results showing the diff before & after. We use to have some benchmark in the old repo but haven't port them to arrow yet.

GregBowyer · 2020-11-18T07:02:20Z

I think I pulled all the transmutes out so thats good

Thanks @GregBowyer for the PR. It's will be awesome if we can get this done. I think before merging this we should get some benchmark results showing the diff before & after. We use to have some benchmark in the old repo but haven't port them to arrow yet.

I will just pull them in tomorrow and get some baselines, I dont think it will be that hard.

I doubt there will be anything here for performance. I mean there are some more branches now, but I suspect the predict effortlessly, and on the flip side we are not chasing things around the I-Cache so I suspect its a wash in the end. However better checked than assumed.

alamb

I looked carefully at the first half of this PR and it looks quite good. By my reading this has also removed the use of transmute

Assuming this looks good from the benchmark perspective as mentioned by @sunchao I think this is a great step forward. Thank you @GregBowyer

rust/parquet/src/column/writer.rs

rust/parquet/src/data_type.rs

alamb · 2020-11-18T16:12:56Z

rust/parquet/src/data_type.rs

+    impl_from_raw!(f32, self => { Err(general_err!("Type cannot be converted to i64")) });
+    impl_from_raw!(f64, self => { Err(general_err!("Type cannot be converted to i64")) });


Suggested change

impl_from_raw!(f32, self => { Err(general_err!("Type cannot be converted to i64")) });

impl_from_raw!(f64, self => { Err(general_err!("Type cannot be converted to i64")) });

impl_from_raw!(f32, self => { Err(general_err!("Type f32 cannot be converted to i64")) });

impl_from_raw!(f64, self => { Err(general_err!("Type f64 cannot be converted to i64")) });

rust/rust-toolchain

alamb · 2020-11-18T16:53:07Z

Here is a potential contribution to this effort: #8708 (a PR with the bench marks ported -- fyi @GregBowyer ).

To run:

cd arrow/rust/parquet
cargo bench

jorgecarleitao · 2020-11-19T18:54:27Z

I do not have time to review this, but having tried this myself once (and failed miserably), I am just leaving a big thank you note to @GregBowyer for this 😍

GregBowyer · 2020-11-24T02:58:29Z

As per benchmarking this with the changes in alamb#2 this is slower than specialisation (I ran benchmarks a lot as its very noisy). I have solutions for the speed in encoding (boolean and int96 are to be solved but shouldn't be hard) I will work on decoding shortly.

GregBowyer · 2020-12-02T02:02:41Z

I have been working on this w.r.t performance, I think I have most parts performing better than the original. I am running off clean benchmarks right now to validate.

There are a few ancillary tweaks I have made (I will call out in the PR review) that gain some additional performance (I spotted some pipeline hazards in perf) but nothing crazy

alamb · 2020-12-02T11:40:34Z

I have been working on this w.r.t performance, I think I have most parts performing better than the original. I am running off clean benchmarks right now to validate.

I am very excited to see it. Thank you so much @GregBowyer

nevi-me · 2020-12-03T02:01:15Z

.github/workflows/rust.yml

@@ -50,7 +50,10 @@ jobs:
    strategy:
      fail-fast: false
      matrix:
-        rust: [nightly-2020-11-24]


I know this is still in-flight, but the better strategy here is to run the Parquet tests with cargo +stable test as we also add the stable toolchain to compile arrow with. You might need to rebase to see that in the dev scripts, as this change was added last week.

I am ok with that, I am fighting the CI anyhow and any assistance is welcome.

I'll submit a PR on your branch

I think the safest option is to revert the CI changes, then we can add parquet to stable tests separately in #8821

changes reverted, I will probably need help getting the small CI parts fixed up

rust/parquet/src/data_type.rs

GregBowyer · 2020-12-12T00:51:13Z

Ok review comments been addressed.

I seperated out the benchmark code from this PR and rebased it to match comment style in git.

I think when CI can be made to function I am good with the PR, I am thinking of following up with a round of performance tuning in compression and delta_fixed_bin

codecov-io · 2020-12-12T01:07:25Z

Codecov Report

Merging #8698 (f8f9749) into master (edff65d) will increase coverage by 7.47%.
The diff coverage is 84.08%.

@@            Coverage Diff             @@
##           master    #8698      +/-   ##
==========================================
+ Coverage   76.81%   84.28%   +7.47%     
==========================================
  Files         181      194      +13     
  Lines       40985    47951    +6966     
==========================================
+ Hits        31483    40417    +8934     
+ Misses       9502     7534    -1968

Impacted Files	Coverage Δ
rust/parquet/src/record/triplet.rs	`93.24% <0.00%> (ø)`
rust/parquet/src/data_type.rs	`80.29% <75.64%> (-4.32%)`	⬇️
rust/parquet/src/encodings/decoding.rs	`92.49% <86.91%> (+1.03%)`	⬆️
rust/parquet/src/encodings/encoding.rs	`95.59% <91.02%> (+2.41%)`	⬆️
rust/parquet/src/encodings/rle.rs	`93.13% <93.33%> (+0.10%)`	⬆️
rust/parquet/src/arrow/arrow_reader.rs	`90.58% <100.00%> (ø)`
rust/parquet/src/arrow/converter.rs	`62.96% <100.00%> (+7.40%)`	⬆️
rust/parquet/src/column/writer.rs	`94.02% <100.00%> (+0.12%)`	⬆️
rust/parquet/src/file/statistics.rs	`93.80% <100.00%> (-0.60%)`	⬇️
rust/parquet/src/util/bit_util.rs	`92.39% <100.00%> (+0.14%)`	⬆️
... and 102 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update edff65d...f8f9749. Read the comment docs.

jorgecarleitao · 2020-12-13T09:36:13Z

Thanks a lot, @GregBowyer ! I think that this is almost done here.

Two comments:

It seems that there are some changes on the submodule testing committed to this PR that I am not sure are intended. Could you double check? Typically this happens when git add -u is called before git submodule update.
I have been updating the CI for this PR here, and in particular this patch.

jorgecarleitao · 2020-12-14T14:20:28Z

A small update: the CI runs successfully for everything except Clippy (issue unrelated to this PR) 🎉 🎉🎉

I fielded #8909 that addresses the Clippy warning.

I see two options:

incorporate the patch with the CI changes mentioned above and merge this PR with those changes
merge this as is now and PR the change to the CI separately

Let me know what you prefer, @GregBowyer 😃

alamb · 2020-12-14T18:35:26Z

I suggest merging this PR as is and then putting in the CI change as a follow on PR (this one is already large and long outstanding). Just let us know @GregBowyer

Remove specialization from parquet-rs. This allows the codebase to be compiled with stable

GregBowyer · 2020-12-15T00:29:45Z

I am ok merging as is and putting a fixup in later.

I pushed commits that should not alter the testing submodule

alamb

I gave this one more once over and it (still looks good to me). Merging this in

alamb · 2020-12-15T20:50:51Z

Filed https://issues.apache.org/jira/browse/ARROW-10929 to track migrating CI to stable rust

alamb · 2020-12-15T20:58:29Z

I did one last local test run after merging from master to make sure the tests passed. They did. So I'll also send a note over to the dev channel. Again thanks so much for this work @GregBowyer

cc @sunchao @andygrove @jorgecarleitao @vertexclique @jhorstmann @Dandandan

alamb · 2020-12-15T20:59:26Z

I filed https://issues.apache.org/jira/browse/ARROW-10931 to track follow on work to improve compressors

Dandandan · 2020-12-15T21:04:55Z

Wow, awesome work everyone 💯

Update the docs with respect to changes after #8698 Closes #8931 from alamb/alamb/ARROW-10933-update-docs Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@jorgecarleitao

This is a cherry-pick of https://github.com/jorgecarleitao/arrow/commit/ca66d6d945e265dd2c83464bd80ff1dd7d231f7c by @jorgecarleitao It runs all tests except the simd using `stable` -- The SIMD feature still require nightly rust, but the default features do not (after #8698) Update: It also silences a few clippy lints which start complaining on stable -- I'll comment inline Closes #8930 from alamb/ARROW-10929-stable-ci Lead-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

Update the docs with respect to changes after apache/arrow#8698 Closes #8931 from alamb/alamb/ARROW-10933-update-docs Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@jorgecarleitao

This is a cherry-pick of https://github.com/jorgecarleitao/arrow/commit/ca66d6d945e265dd2c83464bd80ff1dd7d231f7c by @jorgecarleitao It runs all tests except the simd using `stable` -- The SIMD feature still require nightly rust, but the default features do not (after apache/arrow#8698) Update: It also silences a few clippy lints which start complaining on stable -- I'll comment inline Closes #8930 from alamb/ARROW-10929-stable-ci Lead-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

…lization in parquet https://issues.apache.org/jira/browse/ARROW-10636 This is a very initial attempt at removing the specialization features from the Rust Parquet implementation. The specialisation is too complex to be covered by `min_specialization` and requires a bit of reworking in the crate. Right now the code dispatches in sub-traits and methods on the Parquet type, and uses a combination of trait abuse, macros and transmutes to eliminate the feature. I have broken this up into several commits ranging from the simplest removals (which could probably be taken fairly easily) to the most ugly and complex. I am not stoked on the `transmute` abuse, and I think another take (or follow up) should be taken to remove as many as possible in the code. The general trait for `DataType::T` has been made a private sealed trait to make it impossible to implement external to the Parquet crate, this is intentional as I dont think many of the public interfaces are sensible for end users to be able to implement. # TODO: - [x] Purge the added `std::mem::transmute`s if possible - [x] Refine and rationalise the `unimplemented!` implementations - [x] Performance test? - [x] Rebase & Relabel commits with JIRA number Closes apache#8698 from GregBowyer/remove-rust-specialization Authored-by: Greg Bowyer <[email protected]> Signed-off-by: Andrew Lamb <[email protected]>

Update the docs with respect to changes after apache#8698 Closes apache#8931 from alamb/alamb/ARROW-10933-update-docs Authored-by: Andrew Lamb <[email protected]> Signed-off-by: Andy Grove <[email protected]>

@jorgecarleitao

This is a cherry-pick of https://github.com/jorgecarleitao/arrow/commit/ca66d6d945e265dd2c83464bd80ff1dd7d231f7c by @jorgecarleitao It runs all tests except the simd using `stable` -- The SIMD feature still require nightly rust, but the default features do not (after apache#8698) Update: It also silences a few clippy lints which start complaining on stable -- I'll comment inline Closes apache#8930 from alamb/ARROW-10929-stable-ci Lead-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Jorge C. Leitao <[email protected]> Signed-off-by: Jorge C. Leitao <[email protected]>

github-actions bot added Component: Rust Component: Parquet labels Nov 18, 2020

GregBowyer commented Nov 18, 2020

View reviewed changes

GregBowyer changed the title ~~Remove rust specialization~~ ARROW-10636: [Rust][Parquet] Remove rust specialization Nov 18, 2020

GregBowyer force-pushed the remove-rust-specialization branch from 94f58d0 to ba7d30b Compare November 18, 2020 03:42

nevi-me requested review from sunchao and nevi-me and removed request for sunchao November 18, 2020 05:44

alamb reviewed Nov 18, 2020

View reviewed changes

alamb mentioned this pull request Nov 18, 2020

ARROW-10647: [Rust] [Parquet] Port benchmarks from from parquet-rs to arrow repo #8708

Closed

alamb mentioned this pull request Nov 19, 2020

ARROW-10653: [Rust] Update toolchain nightly #8713

Closed

github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020

GregBowyer force-pushed the remove-rust-specialization branch from 6387b54 to 10fd951 Compare December 2, 2020 22:35

github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 2, 2020

nevi-me reviewed Dec 3, 2020

View reviewed changes