Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: reduce copies and skip zeroing memory we read into #1238

Merged
merged 1 commit into from
Sep 6, 2023

Conversation

wjones127
Copy link
Contributor

@wjones127 wjones127 commented Sep 6, 2023

When reading primitive and binary data, we are copying data. We can skip the copies if the buffer is already correctly aligned.

Also, for local filesystem we are spending a lot of time zeroing memory that we immediately write into. This PR change to just initialize without zeroing and write into the data. A check is added to make sure that we don't expose any uninitialized data in the output buffer.

@wjones127
Copy link
Contributor Author

Benchmark results (using benchmarks in #1235):

------------------------------------------------------------------------------------- benchmark 'query_ann': 6 tests ------------------------------------------------------------------------------------
Name (time in ms)                              Min                Max              Mean            StdDev            Median               IQR            Outliers       OPS            Rounds  Iterations
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_knn_search (NOW)                       6.0523 (1.0)       9.0947 (1.39)     6.3772 (1.01)     0.3069 (3.72)     6.3305 (1.00)     0.0881 (1.0)          6;11  156.8097 (0.99)        148           1
test_ivf_pq_index_search (NOW)              6.0839 (1.01)      6.5353 (1.0)      6.3159 (1.0)      0.0854 (1.03)     6.3201 (1.0)      0.1029 (1.17)         42;5  158.3313 (1.0)         155           1
test_flat_index_search (NOW)                6.1278 (1.01)      6.6472 (1.02)     6.3311 (1.00)     0.0826 (1.0)      6.3308 (1.00)     0.1036 (1.18)         38;3  157.9492 (1.00)        151           1
test_ivf_pq_index_search (0001_baselin)     8.9123 (1.47)      9.6842 (1.48)     9.2520 (1.46)     0.1316 (1.59)     9.2495 (1.46)     0.1609 (1.83)         27;2  108.0852 (0.68)        106           1
test_knn_search (0001_baselin)              8.9669 (1.48)     11.8117 (1.81)     9.2748 (1.47)     0.2820 (3.42)     9.2483 (1.46)     0.1804 (2.05)          5;2  107.8189 (0.68)        111           1
test_flat_index_search (0001_baselin)       9.0008 (1.49)      9.6396 (1.48)     9.2249 (1.46)     0.1062 (1.29)     9.2165 (1.46)     0.1418 (1.61)         33;2  108.4024 (0.68)        112           1
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------- benchmark 'scan_single_column': 10 tests --------------------------------------------------------------------------------------
Name (time in ms)                                     Min                 Max                Mean             StdDev              Median                IQR            Outliers      OPS            Rounds  Iterations
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_scan_integer[i32] (NOW)                      18.9536 (1.0)       22.7854 (1.0)       19.5155 (1.0)       0.5818 (2.43)      19.4337 (1.0)       0.4006 (1.75)          2;2  51.2413 (1.0)          50           1
test_scan_integer[i32] (0001_baselin)             19.1967 (1.01)      24.4416 (1.07)      19.7049 (1.01)      0.7084 (2.96)      19.5683 (1.01)      0.2842 (1.24)          1;1  50.7488 (0.99)         51           1
test_scan_integer[f64] (NOW)                      21.9462 (1.16)      22.8395 (1.00)      22.3748 (1.15)      0.2393 (1.0)       22.3879 (1.15)      0.2908 (1.27)         18;0  44.6931 (0.87)         46           1
test_scan_integer[dictionary] (NOW)               23.6696 (1.25)      37.1745 (1.63)      25.6355 (1.31)      4.6738 (19.53)     23.8518 (1.23)      0.7093 (3.11)          1;1  39.0083 (0.76)          8           1
test_scan_integer[dictionary] (0001_baselin)      23.7726 (1.25)      25.3193 (1.11)      24.6085 (1.26)      0.3772 (1.58)      24.5648 (1.26)      0.4285 (1.88)         12;2  40.6363 (0.79)         41           1
test_scan_integer[string] (0001_baselin)          26.2319 (1.38)      31.7903 (1.40)      26.9231 (1.38)      0.9194 (3.84)      26.7149 (1.37)      0.2283 (1.0)           1;4  37.1428 (0.72)         33           1
test_scan_integer[string] (NOW)                   26.2530 (1.39)      28.0359 (1.23)      26.8354 (1.38)      0.3446 (1.44)      26.8405 (1.38)      0.4436 (1.94)          7;1  37.2642 (0.73)         32           1
test_scan_integer[f64] (0001_baselin)             26.6362 (1.41)      28.7058 (1.26)      27.2493 (1.40)      0.3979 (1.66)      27.1836 (1.40)      0.4645 (2.03)          8;2  36.6982 (0.72)         42           1
test_scan_integer[vector] (NOW)                   73.8875 (3.90)      83.7876 (3.68)      78.4069 (4.02)      3.5699 (14.92)     77.5117 (3.99)      6.4593 (28.29)         4;0  12.7540 (0.25)         10           1
test_scan_integer[vector] (0001_baselin)         101.1503 (5.34)     147.7436 (6.48)     118.7385 (6.08)     17.7679 (74.24)    116.6773 (6.00)     20.1421 (88.21)         1;0   8.4219 (0.16)          5           1
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

----------------------------------------------------------------------------------- benchmark 'scan_table': 2 tests -----------------------------------------------------------------------------------
Name (time in ms)                       Min                 Max                Mean             StdDev              Median                IQR            Outliers     OPS            Rounds  Iterations
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_scan_table (NOW)              104.1456 (1.0)      115.6711 (1.0)      107.6676 (1.0)       3.6844 (1.0)      106.3717 (1.0)       3.6442 (1.0)           1;1  9.2878 (1.0)           8           1
test_scan_table (0001_baselin)     140.6907 (1.35)     170.1554 (1.47)     152.3343 (1.41)     12.5070 (3.39)     148.7810 (1.40)     22.0398 (6.05)          2;0  6.5645 (0.71)          7           1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

@wjones127 wjones127 marked this pull request as ready for review September 6, 2023 19:00
@wjones127 wjones127 requested a review from eddyxu September 6, 2023 19:08
@@ -230,9 +231,22 @@ impl<'a, T: ByteArrayType> BinaryDecoder<'a, T> {
.null_bit_buffer(Some(null_buf.into()));
}

// TODO: replace this with safe method once arrow-rs 47.0.0 comes out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is 47.0.0 coming out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a week or two? And then we have to wait another two weeks for the next Datafusion release cycle. So we can do this in like 4 weeks I think.

// Zero-copy conversion from bytes
// Safety: the bytes are owned by the `data` value, so the pointer
// will be valid for the lifetime of the Arc we are passing in.
let buf = unsafe {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need the alignment check because this is already byte_width aligned?
also, technically bytes can be empty right? Would that cause any issues with NonNull::new(bytes.as_ptr() as _).unwrap()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah for bytes I think we just require need it to be aligned with u8, which is true of any allocation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our simd code uses unaligned load now, so it is unnecessary to be word-aligned (i.e., 8 bytes). About 10% overhead due to misaligned memory loading in SIMD computation, much less in overall e2e latency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I align the primitive buffers because arrow-rs errors otherwise

@wjones127 wjones127 merged commit 6e03add into main Sep 6, 2023
@wjones127 wjones127 deleted the wjones127/reduce-copies branch September 6, 2023 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants