Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-10636: [Rust][Parquet] Switch to Rust Stable by removing specialization in parquet #8698

Closed

Conversation

GregBowyer
Copy link
Contributor

@GregBowyer GregBowyer commented Nov 18, 2020

https://issues.apache.org/jira/browse/ARROW-10636

This is a very initial attempt at removing the specialization features from the Rust Parquet implementation.

The specialisation is too complex to be covered by min_specialization and requires a bit of reworking in the crate.

Right now the code dispatches in sub-traits and methods on the Parquet type, and uses a combination of trait abuse, macros and transmutes to eliminate the feature.

I have broken this up into several commits ranging from the simplest removals (which could probably be taken fairly easily) to the most ugly and complex.

I am not stoked on the transmute abuse, and I think another take (or follow up) should be taken to remove as many as possible in the code.

The general trait for DataType::T has been made a private sealed trait to make it impossible to implement external to the Parquet crate, this is intentional as I dont think many of the public interfaces are sensible for end users to be able to implement.

TODO:

  • Purge the added std::mem::transmutes if possible
  • Refine and rationalise the unimplemented! implementations
  • Performance test?
  • Rebase & Relabel commits with JIRA number

@github-actions
Copy link

Thanks for opening a pull request!

Could you open an issue for this pull request on JIRA?
https://issues.apache.org/jira/browse/ARROW

Then could you also rename pull request title in the following format?

ARROW-${JIRA_ID}: [${COMPONENT}] ${SUMMARY}

See also:

@@ -95,6 +96,12 @@ impl From<Vec<u32>> for Int96 {
}
}

impl fmt::Display for Int96 {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{:?}", self.data())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Maybe as a followup this should print the number rather than its raw bytes?

@GregBowyer GregBowyer changed the title Remove rust specialization ARROW-10636: [Rust][Parquet] Remove rust specialization Nov 18, 2020
@GregBowyer GregBowyer force-pushed the remove-rust-specialization branch from 94f58d0 to ba7d30b Compare November 18, 2020 03:42
@github-actions
Copy link

@GregBowyer
Copy link
Contributor Author

I think I have a solution with std::any::Any for most of the transmutes

@nevi-me nevi-me requested review from sunchao and nevi-me and removed request for sunchao November 18, 2020 05:44
@sunchao
Copy link
Member

sunchao commented Nov 18, 2020

Thanks @GregBowyer for the PR. It's will be awesome if we can get this done. I think before merging this we should get some benchmark results showing the diff before & after. We use to have some benchmark in the old repo but haven't port them to arrow yet.

@GregBowyer
Copy link
Contributor Author

I think I pulled all the transmutes out so thats good

Thanks @GregBowyer for the PR. It's will be awesome if we can get this done. I think before merging this we should get some benchmark results showing the diff before & after. We use to have some benchmark in the old repo but haven't port them to arrow yet.

I will just pull them in tomorrow and get some baselines, I dont think it will be that hard.

I doubt there will be anything here for performance. I mean there are some more branches now, but I suspect the predict effortlessly, and on the flip side we are not chasing things around the I-Cache so I suspect its a wash in the end. However better checked than assumed.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked carefully at the first half of this PR and it looks quite good. By my reading this has also removed the use of transmute

Assuming this looks good from the benchmark perspective as mentioned by @sunchao I think this is a great step forward. Thank you @GregBowyer

rust/parquet/src/column/writer.rs Show resolved Hide resolved
rust/parquet/src/data_type.rs Outdated Show resolved Hide resolved
Comment on lines +582 to +784
impl_from_raw!(f32, self => { Err(general_err!("Type cannot be converted to i64")) });
impl_from_raw!(f64, self => { Err(general_err!("Type cannot be converted to i64")) });
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
impl_from_raw!(f32, self => { Err(general_err!("Type cannot be converted to i64")) });
impl_from_raw!(f64, self => { Err(general_err!("Type cannot be converted to i64")) });
impl_from_raw!(f32, self => { Err(general_err!("Type f32 cannot be converted to i64")) });
impl_from_raw!(f64, self => { Err(general_err!("Type f64 cannot be converted to i64")) });

rust/rust-toolchain Outdated Show resolved Hide resolved
@alamb
Copy link
Contributor

alamb commented Nov 18, 2020

Here is a potential contribution to this effort: #8708 (a PR with the bench marks ported -- fyi @GregBowyer ).

To run:

cd arrow/rust/parquet
cargo bench

@jorgecarleitao
Copy link
Member

I do not have time to review this, but having tried this myself once (and failed miserably), I am just leaving a big thank you note to @GregBowyer for this 😍

@GregBowyer
Copy link
Contributor Author

As per benchmarking this with the changes in alamb#2 this is slower than specialisation (I ran benchmarks a lot as its very noisy). I have solutions for the speed in encoding (boolean and int96 are to be solved but shouldn't be hard) I will work on decoding shortly.

@github-actions github-actions bot added the needs-rebase A PR that needs to be rebased by the author label Nov 25, 2020
@GregBowyer
Copy link
Contributor Author

I have been working on this w.r.t performance, I think I have most parts performing better than the original. I am running off clean benchmarks right now to validate.

There are a few ancillary tweaks I have made (I will call out in the PR review) that gain some additional performance (I spotted some pipeline hazards in perf) but nothing crazy

@alamb
Copy link
Contributor

alamb commented Dec 2, 2020

I have been working on this w.r.t performance, I think I have most parts performing better than the original. I am running off clean benchmarks right now to validate.

I am very excited to see it. Thank you so much @GregBowyer

@GregBowyer GregBowyer force-pushed the remove-rust-specialization branch from 6387b54 to 10fd951 Compare December 2, 2020 22:35
@github-actions github-actions bot removed the needs-rebase A PR that needs to be rebased by the author label Dec 2, 2020
@@ -50,7 +50,10 @@ jobs:
strategy:
fail-fast: false
matrix:
rust: [nightly-2020-11-24]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is still in-flight, but the better strategy here is to run the Parquet tests with cargo +stable test as we also add the stable toolchain to compile arrow with. You might need to rebase to see that in the dev scripts, as this change was added last week.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am ok with that, I am fighting the CI anyhow and any assistance is welcome.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll submit a PR on your branch

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the safest option is to revert the CI changes, then we can add parquet to stable tests separately in #8821

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes reverted, I will probably need help getting the small CI parts fixed up

@GregBowyer GregBowyer force-pushed the remove-rust-specialization branch from 918dcbd to f8f9749 Compare December 12, 2020 00:50
@GregBowyer
Copy link
Contributor Author

Ok review comments been addressed.

I seperated out the benchmark code from this PR and rebased it to match comment style in git.

I think when CI can be made to function I am good with the PR, I am thinking of following up with a round of performance tuning in compression and delta_fixed_bin

@codecov-io
Copy link

Codecov Report

Merging #8698 (f8f9749) into master (edff65d) will increase coverage by 7.47%.
The diff coverage is 84.08%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #8698      +/-   ##
==========================================
+ Coverage   76.81%   84.28%   +7.47%     
==========================================
  Files         181      194      +13     
  Lines       40985    47951    +6966     
==========================================
+ Hits        31483    40417    +8934     
+ Misses       9502     7534    -1968     
Impacted Files Coverage Δ
rust/parquet/src/record/triplet.rs 93.24% <0.00%> (ø)
rust/parquet/src/data_type.rs 80.29% <75.64%> (-4.32%) ⬇️
rust/parquet/src/encodings/decoding.rs 92.49% <86.91%> (+1.03%) ⬆️
rust/parquet/src/encodings/encoding.rs 95.59% <91.02%> (+2.41%) ⬆️
rust/parquet/src/encodings/rle.rs 93.13% <93.33%> (+0.10%) ⬆️
rust/parquet/src/arrow/arrow_reader.rs 90.58% <100.00%> (ø)
rust/parquet/src/arrow/converter.rs 62.96% <100.00%> (+7.40%) ⬆️
rust/parquet/src/column/writer.rs 94.02% <100.00%> (+0.12%) ⬆️
rust/parquet/src/file/statistics.rs 93.80% <100.00%> (-0.60%) ⬇️
rust/parquet/src/util/bit_util.rs 92.39% <100.00%> (+0.14%) ⬆️
... and 102 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update edff65d...f8f9749. Read the comment docs.

@jorgecarleitao
Copy link
Member

Thanks a lot, @GregBowyer ! I think that this is almost done here.

Two comments:

  1. It seems that there are some changes on the submodule testing committed to this PR that I am not sure are intended. Could you double check? Typically this happens when git add -u is called before git submodule update.

  2. I have been updating the CI for this PR here, and in particular this patch.

@jorgecarleitao
Copy link
Member

A small update: the CI runs successfully for everything except Clippy (issue unrelated to this PR) 🎉 🎉🎉

I fielded #8909 that addresses the Clippy warning.

I see two options:

  1. incorporate the patch with the CI changes mentioned above and merge this PR with those changes
  2. merge this as is now and PR the change to the CI separately

Let me know what you prefer, @GregBowyer 😃

@alamb
Copy link
Contributor

alamb commented Dec 14, 2020

I suggest merging this PR as is and then putting in the CI change as a follow on PR (this one is already large and long outstanding). Just let us know @GregBowyer

Remove specialization from parquet-rs. This allows the codebase to be
compiled with stable
@GregBowyer GregBowyer force-pushed the remove-rust-specialization branch from aeda0f2 to 816b129 Compare December 15, 2020 00:29
@GregBowyer
Copy link
Contributor Author

I am ok merging as is and putting a fixup in later.

I pushed commits that should not alter the testing submodule

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave this one more once over and it (still looks good to me). Merging this in

@alamb
Copy link
Contributor

alamb commented Dec 15, 2020

Filed https://issues.apache.org/jira/browse/ARROW-10929 to track migrating CI to stable rust

@alamb alamb closed this in d202f1b Dec 15, 2020
@alamb
Copy link
Contributor

alamb commented Dec 15, 2020

I did one last local test run after merging from master to make sure the tests passed. They did. So :shipit: I'll also send a note over to the dev channel. Again thanks so much for this work @GregBowyer

cc @sunchao @andygrove @jorgecarleitao @vertexclique @jhorstmann @Dandandan

@alamb
Copy link
Contributor

alamb commented Dec 15, 2020

I filed https://issues.apache.org/jira/browse/ARROW-10931 to track follow on work to improve compressors

@Dandandan
Copy link
Contributor

Wow, awesome work everyone 💯

andygrove pushed a commit that referenced this pull request Dec 16, 2020
Update the docs with respect to changes after #8698

Closes #8931 from alamb/alamb/ARROW-10933-update-docs

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
jorgecarleitao added a commit that referenced this pull request Dec 21, 2020
This is a cherry-pick of https://github.com/jorgecarleitao/arrow/commit/ca66d6d945e265dd2c83464bd80ff1dd7d231f7c by @jorgecarleitao

It runs all tests except the simd using `stable` -- The SIMD feature still require nightly rust, but the default features do not (after #8698)

Update: It also silences a few clippy lints which start complaining on stable -- I'll comment inline

Closes #8930 from alamb/ARROW-10929-stable-ci

Lead-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
Update the docs with respect to changes after apache/arrow#8698

Closes #8931 from alamb/alamb/ARROW-10933-update-docs

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
alamb added a commit to apache/arrow-rs that referenced this pull request Apr 20, 2021
This is a cherry-pick of https://github.com/jorgecarleitao/arrow/commit/ca66d6d945e265dd2c83464bd80ff1dd7d231f7c by @jorgecarleitao

It runs all tests except the simd using `stable` -- The SIMD feature still require nightly rust, but the default features do not (after apache/arrow#8698)

Update: It also silences a few clippy lints which start complaining on stable -- I'll comment inline

Closes #8930 from alamb/ARROW-10929-stable-ci

Lead-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
…lization in parquet

https://issues.apache.org/jira/browse/ARROW-10636

This is a very initial attempt at removing the specialization features from the Rust Parquet implementation.

The specialisation is too complex to be covered by `min_specialization` and requires a bit of reworking in the crate.

Right now the code dispatches in sub-traits and methods on the Parquet type, and uses a combination of trait abuse, macros and transmutes to eliminate the feature.

I have broken this up into several commits ranging from the simplest removals (which could probably be taken fairly easily) to the most ugly and complex.

I am not stoked on the `transmute` abuse, and I think another take (or follow up) should be taken to remove as many as possible in the code.

The general trait for `DataType::T` has been made a private sealed trait to make it impossible to implement external to the Parquet crate, this is intentional as I dont think many of the public interfaces are sensible for end users to be able to implement.

# TODO:
- [x] Purge the added `std::mem::transmute`s if possible
- [x] Refine and rationalise the `unimplemented!` implementations
- [x] Performance test?
- [x] Rebase & Relabel commits with JIRA number

Closes apache#8698 from GregBowyer/remove-rust-specialization

Authored-by: Greg Bowyer <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Update the docs with respect to changes after apache#8698

Closes apache#8931 from alamb/alamb/ARROW-10933-update-docs

Authored-by: Andrew Lamb <[email protected]>
Signed-off-by: Andy Grove <[email protected]>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
This is a cherry-pick of https://github.com/jorgecarleitao/arrow/commit/ca66d6d945e265dd2c83464bd80ff1dd7d231f7c by @jorgecarleitao

It runs all tests except the simd using `stable` -- The SIMD feature still require nightly rust, but the default features do not (after apache#8698)

Update: It also silences a few clippy lints which start complaining on stable -- I'll comment inline

Closes apache#8930 from alamb/ARROW-10929-stable-ci

Lead-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Jorge C. Leitao <[email protected]>
Signed-off-by: Jorge C. Leitao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants