Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split out arrow-buffer crate (#2594) #2693

Merged
merged 6 commits into from
Sep 15, 2022

Conversation

tustvold
Copy link
Contributor

@tustvold tustvold commented Sep 9, 2022

Which issue does this PR close?

Part of #2594.

Rationale for this change

Begins the process of gradually exploding this crate, to help with compile times, codesize, etc...

What changes are included in this PR?

Splits the arrow buffer types into their own crate. This is mostly mechanical but there are some breaking changes

Are there any user-facing changes?

Yes, hopefully none that are controversial

@tustvold tustvold added the api-change Changes to the arrow API label Sep 9, 2022
@github-actions github-actions bot added the arrow Changes to the arrow crate label Sep 9, 2022

#[inline]
unsafe fn null_pointer<T: NativeType>() -> NonNull<T> {
NonNull::new_unchecked(ALIGNMENT as *mut T)
unsafe fn null_pointer() -> NonNull<u8> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is technically a breaking change, it appears to relate to early experiments by @jorgecarleitao r.e. arrow2. It isn't used anywhere, and so I think can just go. The removal is so that arrow-buffer doesn't depend on DataType

@@ -38,15 +38,15 @@ fn create_buffer(size: usize) -> Buffer {
}

fn bench_buffer_and(left: &Buffer, right: &Buffer) {
criterion::black_box((left & right).unwrap());
criterion::black_box(buffer_bin_and(left, 0, right, 0, left.len() * 8));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the other breaking change, I opted to remove the BitAnd, etc... implementations for Buffer for two reasons:

  • They were very easy to accidentally forget to apply the offset
  • They need an error enumeration, and defining an error type in arrow-buffer solely for this case seemed overkill

Some(right_bitmap) => {
// NOTE: right values and bitmaps are combined and stay at bit offset right.offset()
(right.values() & &right_bitmap.bits).ok().map(|b| b.not())
let and = buffer_bin_and(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is not only simpler now, but avoids performing work on values past the offset

@tustvold tustvold mentioned this pull request Sep 9, 2022
Comment on lines +21 to +27
mod immutable;
pub use immutable::*;
mod mutable;
pub use mutable::*;
mod ops;
mod scalar;
pub use scalar::*;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if it makes sense to just collapse the /buffer/ folder to top level given arrow_buffer crate name gives that level of indirection already

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think as the crate still contains other public modules, like alloc, it would be confusing to lift the contents of buffer to the top-level

@tustvold tustvold requested a review from viirya September 10, 2022 20:46
unsafe {
if size == 0 {
null_pointer()
} else {
let size = size * size_of::<T>();
let size = size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let size = size;

}

/// Allocates a cache-aligned memory region of `size` bytes with uninitialized values.
/// This is more performant than using [allocate_aligned_zeroed] when all bytes will have
/// an unknown or non-zero value and is semantically similar to `malloc`.
pub fn allocate_aligned<T: NativeType>(size: usize) -> NonNull<T> {
pub fn allocate_aligned(size: usize) -> NonNull<u8> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the caller needs to compute the allocation size by the type T they want?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, although nothing was actually using this API to allocate anything other than bytes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it was also related to a time when there were fewer things that were ArrowNativeType (I can't remember but that may have changed over time)

unsafe fn null_pointer<T: NativeType>() -> NonNull<T> {
NonNull::new_unchecked(ALIGNMENT as *mut T)
unsafe fn null_pointer() -> NonNull<u8> {
NonNull::new_unchecked(ALIGNMENT as *mut u8)
}

/// Allocates a cache-aligned memory region of `size` bytes with uninitialized values.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like previously the doc is not totally correct as size will be multiplied by size of T.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually I don't see any allocate_aligned usage with types other than u8 so far.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, I believe Jorge opted to fork off arrow2 before getting further with integrating this

arrow-buffer/src/bytes.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. It is very exciting to see this happening

I suggest we should run some benchmarks if we haven't already done so to ensure this doesn't cause performance regressions (due to different cross crate inlining rules or something else silly)

I have also added a note to #2665 that we will have to update the release process as we add some new crates

}

/// Allocates a cache-aligned memory region of `size` bytes with uninitialized values.
/// This is more performant than using [allocate_aligned_zeroed] when all bytes will have
/// an unknown or non-zero value and is semantically similar to `malloc`.
pub fn allocate_aligned<T: NativeType>(size: usize) -> NonNull<T> {
pub fn allocate_aligned(size: usize) -> NonNull<u8> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it was also related to a time when there were fewer things that were ArrowNativeType (I can't remember but that may have changed over time)

third_offset_in_bits: usize,
fourth: &Buffer,
fourth_offset_in_bits: usize,
pub fn bitwise_quaternary_op_helper<F>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should profile with this change? It seems like it should not be any different given that buffers and offsets are sized (and thus the compiler can check and elide the bounds checks)

arrow-buffer/src/lib.rs Outdated Show resolved Hide resolved
}
Ok(Bitmap::from(buffer_bin_and(
&self.bits,
0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this does the same things as & (uses offset zero) but I wonder if that is correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same comment applies to the other operators as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BitMap currently doesn't have an offset, this does mean that practically this implementation is likely being used incorrectly in places, but I opted to just preserve the existing behaviour. I agree it is a bit of a footgun #1802

@alamb
Copy link
Contributor

alamb commented Sep 14, 2022

BTW I plan to test this change out against DataFusion

@alamb
Copy link
Contributor

alamb commented Sep 14, 2022

cd /Users/alamb/Software/arrow-datafusion && RUST_BACKTRACE=1 CARGO_TARGET_DIR=/Users/alamb/Software/df-target  nice cargo test --all
    Updating git repository `https://github.com/tustvold/arrow-rs.git`
    Updating git submodule `https://github.com/apache/parquet-testing.git`
    Updating git submodule `https://github.com/apache/arrow-testing`
    Updating crates.io index
   Compiling parquet v22.0.0 (https://github.com/tustvold/arrow-rs.git?rev=f47a878#f47a8784)
   Compiling arrow-buffer v22.0.0 (https://github.com/tustvold/arrow-rs.git?rev=f47a878#f47a8784)
   Compiling arrow-flight v22.0.0 (https://github.com/tustvold/arrow-rs.git?rev=f47a878#f47a8784)
   Compiling arrow v22.0.0 (https://github.com/tustvold/arrow-rs.git?rev=f47a878#f47a8784)
   Compiling fuzz-utils v0.1.0 (/Users/alamb/Software/arrow-datafusion/datafusion/core/fuzz-utils)
   Compiling datafusion-common v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion/common)
   Compiling datafusion-expr v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion/expr)
   Compiling datafusion-row v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion/row)
   Compiling datafusion-physical-expr v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion/physical-expr)
   Compiling datafusion-sql v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion/sql)
   Compiling datafusion-jit v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion/jit)
   Compiling datafusion-optimizer v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion/optimizer)
   Compiling datafusion v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion/core)
   Compiling datafusion-proto v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion/proto)
   Compiling datafusion-benchmarks v12.0.0 (/Users/alamb/Software/arrow-datafusion/benchmarks)
   Compiling datafusion-examples v12.0.0 (/Users/alamb/Software/arrow-datafusion/datafusion-examples)

Seems to work well 👍

diff --git a/Cargo.toml b/Cargo.toml
index b5a7989d9..713c4844b 100644
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -37,6 +37,6 @@ lto = true
 
 # TEMP
 [patch.crates-io]
-arrow = {  git = "https://github.com/apache/arrow-rs.git", rev="51466634f11b7d965ca3c912835c91e0f84a6c92"}
-parquet = {  git = "https://github.com/apache/arrow-rs.git", rev="51466634f11b7d965ca3c912835c91e0f84a6c92" }
-arrow-flight = {  git = "https://github.com/apache/arrow-rs.git", rev="51466634f11b7d965ca3c912835c91e0f84a6c92" }
+arrow = {  git = "https://github.com/tustvold/arrow-rs.git", rev="f47a878"}
+parquet = {  git = "https://github.com/tustvold/arrow-rs.git", rev="f47a878" }
+arrow-flight = {  git = "https://github.com/tustvold/arrow-rs.git", rev="f47a878" }

@tustvold
Copy link
Contributor Author

eq Float32              time:   [44.005 µs 44.131 µs 44.276 µs]
                        change: [+137.97% +138.88% +139.85%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

eq scalar Float32       time:   [40.772 µs 40.865 µs 40.967 µs]
                        change: [+419.00% +420.49% +422.13%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild

Sigh... Time to play LLVM vectorisation bingo again...

@alamb
Copy link
Contributor

alamb commented Sep 14, 2022

Maybe it is time to sprinkle some #[inline] to get cross crate inlining

@tustvold
Copy link
Contributor Author

tustvold commented Sep 14, 2022

If only it were that simple 😢

Edit: Plan to come back to this tomorrow morning, LLVM is not playing ball

@jhorstmann
Copy link
Contributor

FWIW, I'm very in favor of this splitting up into smaller crates. We use a small abstraction layer over arrow buffers or plain vectors and this could reduce the compilation times for that usecase significantly.

Regarding benchmarks, do we actually run these with LTO? My current understanding is that without LTO the inline annotation is required for cross-crate inlining, but with LTO enabled inlining would work in more cases. I think most applications that then use these crates would be compiled with LTO.

@tustvold
Copy link
Contributor Author

At least in theory LTO is only needed for cross-crate inlining of non-inline annotated functions, and even then generic functions should be inlined anyway as a consequence of monomorphisation.

The simple fix is just to move collect_bool into the comparison kernels, but I'm curious to understand what is going on as it may have implications down the line

@@ -395,15 +395,15 @@ impl MutableBuffer {
/// as it eliminates the conditional `Iterator::next`
#[inline]
pub fn collect_bool<F: FnMut(usize) -> bool>(len: usize, mut f: F) -> Self {
let mut buffer = Self::new(bit_util::ceil(len, 8));
let mut buffer = Self::new(bit_util::ceil(len, 64) * 8);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it faster than it was before

eq Float32              time:   [9.2538 µs 9.2627 µs 9.2720 µs]
                        change: [-50.278% -50.193% -50.080%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

eq scalar Float32       time:   [7.2924 µs 7.2962 µs 7.3004 µs]
                        change: [-7.5896% -7.5335% -7.4762%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

neq Float32             time:   [9.1950 µs 9.1992 µs 9.2043 µs]
                        change: [-50.651% -50.607% -50.563%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  4 (4.00%) high mild
  9 (9.00%) high severe

neq scalar Float32      time:   [7.2907 µs 7.2949 µs 7.2997 µs]
                        change: [-7.3374% -7.2597% -7.1685%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) high mild
  3 (3.00%) high severe

lt Float32              time:   [9.2100 µs 9.2180 µs 9.2269 µs]
                        change: [-50.470% -50.359% -50.234%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

lt scalar Float32       time:   [7.2604 µs 7.2638 µs 7.2684 µs]
                        change: [-8.4085% -8.2278% -7.9735%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

lt_eq Float32           time:   [9.2350 µs 9.2410 µs 9.2472 µs]
                        change: [-50.679% -50.603% -50.530%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe

lt_eq scalar Float32    time:   [7.2869 µs 7.2906 µs 7.2946 µs]
                        change: [-7.8178% -7.7122% -7.6141%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

gt Float32              time:   [9.1731 µs 9.1783 µs 9.1844 µs]
                        change: [-50.586% -50.501% -50.367%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe

gt scalar Float32       time:   [7.2970 µs 7.2996 µs 7.3026 µs]
                        change: [-7.6238% -7.4831% -7.2758%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

gt_eq Float32           time:   [9.1925 µs 9.1962 µs 9.2004 µs]
                        change: [-50.274% -50.223% -50.176%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  8 (8.00%) high mild
  4 (4.00%) high severe

And also makes it less reliant on the inlining whims of LLVM. This PR does still represent an ~10% performance hit versus this change applied to master, but I can live with that. We are talking 1 microsecond 😅

@tustvold
Copy link
Contributor Author

Docs failure is related to rust-lang/rust#101844, it runs fine on an older nightly

@tustvold tustvold merged commit fb01656 into apache:master Sep 15, 2022
@ursabot
Copy link

ursabot commented Sep 15, 2022

Benchmark runs are scheduled for baseline = 7594db6 and contender = fb01656. fb01656 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants