Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12411: [Rust] Create RecordBatches from Iterators #7

Merged
merged 1 commit into from
Apr 27, 2021

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Apr 19, 2021

(Here is a PR to the new arrow crate to see what how this process is working!)
Closes #210

Rationale / Usecase:

While writing tests (both in IOx and in DataFusion) where I need a single RecordBatch, I often find myself doing something like this (copied directly from IOx source code):

let schema = Arc::new(Schema::new(vec![
    ArrowField::new("float_field", ArrowDataType::Float64, true),
    ArrowField::new("time", ArrowDataType::Int64, true),
]));

let float_array: ArrayRef = Arc::new(Float64Array::from(vec![10.1, 20.1, 30.1, 40.1]));
let timestamp_array: ArrayRef = Arc::new(Int64Array::from(vec![1000, 2000, 3000, 4000]));

let batch = RecordBatch::try_new(schema, vec![float_array, timestamp_array])
    .expect("created new record batch");

This is annoying because I have to redundantly (and verbosely) encode the information that float_field is a Float64 both in the Schema and the Float64Array

I would much rather be able to construct RecordBatches using a more Rust like style to avoid the the redundancy and reduce the amount of typing / redundancy:

Proposed Change

As suggested in the comments from @returnString @nevi-me @jorgecarleitao in the draft PR: apache/arrow#10063 add try_from_iter and try_iter_with_null functions

let record_batch = RecordBatch::try_from_iter(vec![
  ("a", Arc::new(a) as ArrayRef),
  ("b", Arc::new(b) as ArrayRef)
]).expect("valid conversion");

TryFrom Implementation

Note I would really like to add aTryFrom implementation so I could write

let record_batch: RecordBatch = vec![
  ("a", Arc::new(a) as ArrayRef),
  ("b", Arc::new(b) as ArrayRef)
].try_into().expect("valid conversion");

However, when I tried to do so (with the following):

impl <I,F> TryFrom<I> for RecordBatch
where
     I: IntoIterator<Item=(F, ArrayRef)>,
     F: AsRef<str>,
{
    type Error = ArrowError;

    fn try_from(value: I) -> std::result::Result<Self, Self::Error> {
        Self::try_from_iter(value)
    }
}

I got the following compiler error

     = note: conflicting implementation in crate `core`:
            - impl<T, U> TryFrom<U> for T
              where U: Into<T>;

Which appears to be a limitation of the Rust typesystem / compiler: See rust-lang/rust#50133. Any help / suggestions from reviewers would be most appreciated.

@codecov-commenter
Copy link

codecov-commenter commented Apr 19, 2021

Codecov Report

Merging #7 (3933a28) into master (3f13806) will increase coverage by 0.01%.
The diff coverage is 81.34%.

❗ Current head 3933a28 differs from pull request most recent head f235bec. Consider uploading reports for the commit f235bec to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master       #7      +/-   ##
==========================================
+ Coverage   82.47%   82.49%   +0.01%     
==========================================
  Files         162      162              
  Lines       43414    43454      +40     
==========================================
+ Hits        35805    35846      +41     
+ Misses       7609     7608       -1     
Impacted Files Coverage Δ
arrow-flight/examples/server.rs 0.00% <0.00%> (ø)
arrow-flight/src/arrow.flight.protocol.rs 0.00% <0.00%> (ø)
arrow-flight/src/utils.rs 0.00% <0.00%> (ø)
arrow-pyarrow-integration-testing/src/lib.rs 0.00% <0.00%> (ø)
arrow/src/alloc/types.rs 0.00% <0.00%> (ø)
arrow/src/array/builder.rs 85.29% <ø> (ø)
arrow/src/array/iterator.rs 95.80% <ø> (ø)
arrow/src/array/null.rs 86.66% <ø> (ø)
arrow/src/array/ord.rs 59.28% <ø> (ø)
arrow/src/array/raw_pointer.rs 100.00% <ø> (ø)
... and 94 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5b3298a...f235bec. Read the comment docs.

@jorgecarleitao jorgecarleitao added the arrow Changes to the arrow crate label Apr 20, 2021
@alamb alamb force-pushed the alamb/record_batch_from_iter branch from 5d3c337 to f235bec Compare April 20, 2021 10:05
@alamb
Copy link
Contributor Author

alamb commented Apr 20, 2021

I rewrote master to remove the history of the other languages (and 50MB of history). This means this PR needs to be "rebased" against the current master.

The best way I found to do this was to find the relevant commits (via git log) and then cherry-pick them one at a time. Specifically, for this PR:

git reset --hard apache/master
git cherry-pick 5d3c337710b89871a46f6e623bad8c9e05419ed2
git push -f alamb

alamb pushed a commit that referenced this pull request Apr 20, 2021
This PR enables tests for `ARROW_COMPUTE`, `ARROW_DATASET`, `ARROW_FILESYSTEM`, `ARROW_HDFS`, `ARROW_ORC`, and `ARROW_IPC` (default on). #7131 enabled a minimal set of tests as a starting point.

I confirmed that these tests pass locally with the current master. In the current TravisCI environment, we cannot see this result due to a lot of error messages in `arrow-utility-test`.

```
$ git log | head -1
commit e2e6db2
% ctest
...
      Start  1: arrow-array-test
 1/51 Test  #1: arrow-array-test .....................   Passed    4.62 sec
      Start  2: arrow-buffer-test
 2/51 Test  #2: arrow-buffer-test ....................   Passed    0.14 sec
      Start  3: arrow-extension-type-test
 3/51 Test  #3: arrow-extension-type-test ............   Passed    0.12 sec
      Start  4: arrow-misc-test
 4/51 Test  #4: arrow-misc-test ......................   Passed    0.14 sec
      Start  5: arrow-public-api-test
 5/51 Test  #5: arrow-public-api-test ................   Passed    0.12 sec
      Start  6: arrow-scalar-test
 6/51 Test  #6: arrow-scalar-test ....................   Passed    0.13 sec
      Start  7: arrow-type-test
 7/51 Test  #7: arrow-type-test ......................   Passed    0.14 sec
      Start  8: arrow-table-test
 8/51 Test  #8: arrow-table-test .....................   Passed    0.13 sec
      Start  9: arrow-tensor-test
 9/51 Test  #9: arrow-tensor-test ....................   Passed    0.13 sec
      Start 10: arrow-sparse-tensor-test
10/51 Test #10: arrow-sparse-tensor-test .............   Passed    0.16 sec
      Start 11: arrow-stl-test
11/51 Test #11: arrow-stl-test .......................   Passed    0.12 sec
      Start 12: arrow-concatenate-test
12/51 Test #12: arrow-concatenate-test ...............   Passed    0.53 sec
      Start 13: arrow-diff-test
13/51 Test #13: arrow-diff-test ......................   Passed    1.45 sec
      Start 14: arrow-c-bridge-test
14/51 Test #14: arrow-c-bridge-test ..................   Passed    0.18 sec
      Start 15: arrow-io-buffered-test
15/51 Test #15: arrow-io-buffered-test ...............   Passed    0.20 sec
      Start 16: arrow-io-compressed-test
16/51 Test #16: arrow-io-compressed-test .............   Passed    3.48 sec
      Start 17: arrow-io-file-test
17/51 Test #17: arrow-io-file-test ...................   Passed    0.74 sec
      Start 18: arrow-io-hdfs-test
18/51 Test #18: arrow-io-hdfs-test ...................   Passed    0.12 sec
      Start 19: arrow-io-memory-test
19/51 Test #19: arrow-io-memory-test .................   Passed    2.77 sec
      Start 20: arrow-utility-test
20/51 Test #20: arrow-utility-test ...................***Failed    5.65 sec
      Start 21: arrow-threading-utility-test
21/51 Test #21: arrow-threading-utility-test .........   Passed    1.34 sec
      Start 22: arrow-compute-compute-test
22/51 Test #22: arrow-compute-compute-test ...........   Passed    0.13 sec
      Start 23: arrow-compute-boolean-test
23/51 Test #23: arrow-compute-boolean-test ...........   Passed    0.15 sec
      Start 24: arrow-compute-cast-test
24/51 Test #24: arrow-compute-cast-test ..............   Passed    0.22 sec
      Start 25: arrow-compute-hash-test
25/51 Test #25: arrow-compute-hash-test ..............   Passed    2.61 sec
      Start 26: arrow-compute-isin-test
26/51 Test #26: arrow-compute-isin-test ..............   Passed    0.81 sec
      Start 27: arrow-compute-match-test
27/51 Test #27: arrow-compute-match-test .............   Passed    0.40 sec
      Start 28: arrow-compute-sort-to-indices-test
28/51 Test #28: arrow-compute-sort-to-indices-test ...   Passed    3.33 sec
      Start 29: arrow-compute-nth-to-indices-test
29/51 Test #29: arrow-compute-nth-to-indices-test ....   Passed    1.51 sec
      Start 30: arrow-compute-util-internal-test
30/51 Test #30: arrow-compute-util-internal-test .....   Passed    0.13 sec
      Start 31: arrow-compute-add-test
31/51 Test #31: arrow-compute-add-test ...............   Passed    0.12 sec
      Start 32: arrow-compute-aggregate-test
32/51 Test #32: arrow-compute-aggregate-test .........   Passed   14.70 sec
      Start 33: arrow-compute-compare-test
33/51 Test #33: arrow-compute-compare-test ...........   Passed    7.96 sec
      Start 34: arrow-compute-take-test
34/51 Test #34: arrow-compute-take-test ..............   Passed    4.80 sec
      Start 35: arrow-compute-filter-test
35/51 Test #35: arrow-compute-filter-test ............   Passed    8.23 sec
      Start 36: arrow-dataset-dataset-test
36/51 Test #36: arrow-dataset-dataset-test ...........   Passed    0.25 sec
      Start 37: arrow-dataset-discovery-test
37/51 Test #37: arrow-dataset-discovery-test .........   Passed    0.13 sec
      Start 38: arrow-dataset-file-ipc-test
38/51 Test #38: arrow-dataset-file-ipc-test ..........   Passed    0.21 sec
      Start 39: arrow-dataset-file-test
39/51 Test #39: arrow-dataset-file-test ..............   Passed    0.12 sec
      Start 40: arrow-dataset-filter-test
40/51 Test #40: arrow-dataset-filter-test ............   Passed    0.16 sec
      Start 41: arrow-dataset-partition-test
41/51 Test #41: arrow-dataset-partition-test .........   Passed    0.13 sec
      Start 42: arrow-dataset-scanner-test
42/51 Test #42: arrow-dataset-scanner-test ...........   Passed    0.20 sec
      Start 43: arrow-filesystem-test
43/51 Test #43: arrow-filesystem-test ................   Passed    1.62 sec
      Start 44: arrow-hdfs-test
44/51 Test #44: arrow-hdfs-test ......................   Passed    0.13 sec
      Start 45: arrow-feather-test
45/51 Test #45: arrow-feather-test ...................   Passed    0.91 sec
      Start 46: arrow-ipc-read-write-test
46/51 Test #46: arrow-ipc-read-write-test ............   Passed    5.77 sec
      Start 47: arrow-ipc-json-simple-test
47/51 Test #47: arrow-ipc-json-simple-test ...........   Passed    0.16 sec
      Start 48: arrow-ipc-json-test
48/51 Test #48: arrow-ipc-json-test ..................   Passed    0.27 sec
      Start 49: arrow-json-integration-test
49/51 Test #49: arrow-json-integration-test ..........   Passed    0.13 sec
      Start 50: arrow-json-test
50/51 Test #50: arrow-json-test ......................   Passed    0.26 sec
      Start 51: arrow-orc-adapter-test
51/51 Test #51: arrow-orc-adapter-test ...............   Passed    1.92 sec

98% tests passed, 1 tests failed out of 51

Label Time Summary:
arrow-tests      =  27.38 sec (27 tests)
arrow_compute    =  45.11 sec (14 tests)
arrow_dataset    =   1.21 sec (7 tests)
arrow_ipc        =   6.20 sec (3 tests)
unittest         =  79.91 sec (51 tests)

Total Test time (real) =  79.99 sec

The following tests FAILED:
	 20 - arrow-utility-test (Failed)
Errors while running CTest
```

Closes #7142 from kiszk/ARROW-8754

Authored-by: Kazuaki Ishizaki <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
@alamb alamb force-pushed the alamb/record_batch_from_iter branch from f235bec to 00c95ff Compare April 20, 2021 10:21
@jorgecarleitao
Copy link
Member

We had to perform a small re-write of master. The commits may look a bit odd, but it should not cause conflicts. Could you kindly rebase this against the latest master to make it easier to review?

@alamb alamb force-pushed the alamb/record_batch_from_iter branch from 00c95ff to fea4615 Compare April 24, 2021 10:23
@alamb alamb requested review from nevi-me and jorgecarleitao April 26, 2021 21:47
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

fyi @returnString

Copy link
Contributor

@returnString returnString left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looking forward to less ceremony in test cases 👍

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I thought I had approved this, but it was prob. on the old repo. Thanks for the ping.

This looks great, including great documentation and tests! 👍

@jorgecarleitao
Copy link
Member

fyi, I think that this closes #210 , could you add it to the description so github picks it up?

@alamb alamb merged commit 51513c1 into apache:master Apr 27, 2021
@alamb alamb deleted the alamb/record_batch_from_iter branch April 27, 2021 10:44
@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate enhancement Any new improvement worthy of a entry in the changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Builder interface for adding Arrays to record batches
4 participants