perf: reduce copies and skip zeroing memory we read into #1238

wjones127 · 2023-09-06T17:30:33Z

When reading primitive and binary data, we are copying data. We can skip the copies if the buffer is already correctly aligned.

Also, for local filesystem we are spending a lot of time zeroing memory that we immediately write into. This PR change to just initialize without zeroing and write into the data. A check is added to make sure that we don't expose any uninitialized data in the output buffer.

wjones127 · 2023-09-06T18:55:45Z

Benchmark results (using benchmarks in #1235):

------------------------------------------------------------------------------------- benchmark 'query_ann': 6 tests ------------------------------------------------------------------------------------
Name (time in ms)                              Min                Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_knn_search (NOW)                       6.0523 (1.0)       9.0947 (1.39)     6.3772 (1.01)     0.3069 (3.72)     6.3305 (1.00)     0.0881 (1.0)          6;11  156.8097 (0.99)        148           1
test_ivf_pq_index_search (NOW)              6.0839 (1.01)      6.5353 (1.0)      6.3159 (1.0)      0.0854 (1.03)     6.3201 (1.0)      0.1029 (1.17)         42;5  158.3313 (1.0)         155           1
test_flat_index_search (NOW)                6.1278 (1.01)      6.6472 (1.02)     6.3311 (1.00)     0.0826 (1.0)      6.3308 (1.00)     0.1036 (1.18)         38;3  157.9492 (1.00)        151           1
test_ivf_pq_index_search (0001_baselin)     8.9123 (1.47)      9.6842 (1.48)     9.2520 (1.46)     0.1316 (1.59)     9.2495 (1.46)     0.1609 (1.83)         27;2  108.0852 (0.68)        106           1
test_knn_search (0001_baselin)              8.9669 (1.48)     11.8117 (1.81)     9.2748 (1.47)     0.2820 (3.42)     9.2483 (1.46)     0.1804 (2.05)          5;2  107.8189 (0.68)        111           1
test_flat_index_search (0001_baselin)       9.0008 (1.49)      9.6396 (1.48)     9.2249 (1.46)     0.1062 (1.29)     9.2165 (1.46)     0.1418 (1.61)         33;2  108.4024 (0.68)        112           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------- benchmark 'scan_single_column': 10 tests --------------------------------------------------------------------------------------
Name (time in ms)                                     Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_scan_integer[i32] (NOW)                      18.9536 (1.0)       22.7854 (1.0)       19.5155 (1.0)       0.5818 (2.43)      19.4337 (1.0)       0.4006 (1.75)          2;2  51.2413 (1.0)          50           1
test_scan_integer[i32] (0001_baselin)             19.1967 (1.01)      24.4416 (1.07)      19.7049 (1.01)      0.7084 (2.96)      19.5683 (1.01)      0.2842 (1.24)          1;1  50.7488 (0.99)         51           1
test_scan_integer[f64] (NOW)                      21.9462 (1.16)      22.8395 (1.00)      22.3748 (1.15)      0.2393 (1.0)       22.3879 (1.15)      0.2908 (1.27)         18;0  44.6931 (0.87)         46           1
test_scan_integer[dictionary] (NOW)               23.6696 (1.25)      37.1745 (1.63)      25.6355 (1.31)      4.6738 (19.53)     23.8518 (1.23)      0.7093 (3.11)          1;1  39.0083 (0.76)          8           1
test_scan_integer[dictionary] (0001_baselin)      23.7726 (1.25)      25.3193 (1.11)      24.6085 (1.26)      0.3772 (1.58)      24.5648 (1.26)      0.4285 (1.88)         12;2  40.6363 (0.79)         41           1
test_scan_integer[string] (0001_baselin)          26.2319 (1.38)      31.7903 (1.40)      26.9231 (1.38)      0.9194 (3.84)      26.7149 (1.37)      0.2283 (1.0)           1;4  37.1428 (0.72)         33           1
test_scan_integer[string] (NOW)                   26.2530 (1.39)      28.0359 (1.23)      26.8354 (1.38)      0.3446 (1.44)      26.8405 (1.38)      0.4436 (1.94)          7;1  37.2642 (0.73)         32           1
test_scan_integer[f64] (0001_baselin)             26.6362 (1.41)      28.7058 (1.26)      27.2493 (1.40)      0.3979 (1.66)      27.1836 (1.40)      0.4645 (2.03)          8;2  36.6982 (0.72)         42           1
test_scan_integer[vector] (NOW)                   73.8875 (3.90)      83.7876 (3.68)      78.4069 (4.02)      3.5699 (14.92)     77.5117 (3.99)      6.4593 (28.29)         4;0  12.7540 (0.25)         10           1
test_scan_integer[vector] (0001_baselin)         101.1503 (5.34)     147.7436 (6.48)     118.7385 (6.08)     17.7679 (74.24)    116.6773 (6.00)     20.1421 (88.21)         1;0   8.4219 (0.16)          5           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------- benchmark 'scan_table': 2 tests -----------------------------------------------------------------------------------
Name (time in ms)                       Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_scan_table (NOW)              104.1456 (1.0)      115.6711 (1.0)      107.6676 (1.0)       3.6844 (1.0)      106.3717 (1.0)       3.6442 (1.0)           1;1  9.2878 (1.0)           8           1
test_scan_table (0001_baselin)     140.6907 (1.35)     170.1554 (1.47)     152.3343 (1.41)     12.5070 (3.39)     148.7810 (1.40)     22.0398 (6.05)          2;0  6.5645 (0.71)          7           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

changhiskhan · 2023-09-06T19:10:14Z

rust/src/encodings/binary.rs

@@ -230,9 +231,22 @@ impl<'a, T: ByteArrayType> BinaryDecoder<'a, T> {
                .null_bit_buffer(Some(null_buf.into()));
        }

+        // TODO: replace this with safe method once arrow-rs 47.0.0 comes out.


when is 47.0.0 coming out?

Maybe a week or two? And then we have to wait another two weeks for the next Datafusion release cycle. So we can do this in like 4 weeks I think.

changhiskhan · 2023-09-06T19:12:06Z

rust/src/encodings/binary.rs

+        // Zero-copy conversion from bytes
+        // Safety: the bytes are owned by the `data` value, so the pointer
+        // will be valid for the lifetime of the Arc we are passing in.
+        let buf = unsafe {


we don't need the alignment check because this is already byte_width aligned?
also, technically bytes can be empty right? Would that cause any issues with NonNull::new(bytes.as_ptr() as _).unwrap()?

Yeah for bytes I think we just require need it to be aligned with u8, which is true of any allocation.

Our simd code uses unaligned load now, so it is unnecessary to be word-aligned (i.e., 8 bytes). About 10% overhead due to misaligned memory loading in SIMD computation, much less in overall e2e latency.

Yeah I align the primitive buffers because arrow-rs errors otherwise

perf: reduce copies and skip zeroing memory we read into

9bc423e

wjones127 marked this pull request as ready for review September 6, 2023 19:00

wjones127 requested a review from eddyxu September 6, 2023 19:08

changhiskhan reviewed Sep 6, 2023

View reviewed changes

eddyxu approved these changes Sep 6, 2023

View reviewed changes

wjones127 merged commit 6e03add into main Sep 6, 2023

wjones127 deleted the wjones127/reduce-copies branch September 6, 2023 19:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce copies and skip zeroing memory we read into #1238

perf: reduce copies and skip zeroing memory we read into #1238

wjones127 commented Sep 6, 2023 •

edited

Loading

wjones127 commented Sep 6, 2023

changhiskhan Sep 6, 2023

wjones127 Sep 6, 2023

changhiskhan Sep 6, 2023

wjones127 Sep 6, 2023

eddyxu Sep 6, 2023

wjones127 Sep 6, 2023

perf: reduce copies and skip zeroing memory we read into #1238

perf: reduce copies and skip zeroing memory we read into #1238

Conversation

wjones127 commented Sep 6, 2023 • edited Loading

wjones127 commented Sep 6, 2023

changhiskhan Sep 6, 2023

Choose a reason for hiding this comment

wjones127 Sep 6, 2023

Choose a reason for hiding this comment

changhiskhan Sep 6, 2023

Choose a reason for hiding this comment

wjones127 Sep 6, 2023

Choose a reason for hiding this comment

eddyxu Sep 6, 2023

Choose a reason for hiding this comment

wjones127 Sep 6, 2023

Choose a reason for hiding this comment

wjones127 commented Sep 6, 2023 •

edited

Loading