Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Have an ffi c example that actually reads data #203

Merged
merged 45 commits into from
Jul 1, 2024

Conversation

nicklan
Copy link
Collaborator

@nicklan nicklan commented May 13, 2024

Flesh out what a more full FFI API for data reading would look like. We can certainly iterate on this.

Lots more into the example for using the c-ffi (sorry for another big PR. It's mostly example code at least):

  • Actually read the data
  • Include partition columns
  • Print out the table
  • Use clang-format to get consistent c style
  • Add all the gcc warnings, make them errors, and fix them

Unfortunately the glib arrow api doesn't seem to allow passing options to how we print things, so this will print a max of 4 rows out the table. To print more we'd need to write our own printing function.

This also adds engine_funcs.rs which currently only exposes read_parquet_files. This is more of a POC, but seems like potentially the right way forward for exposing the various functions available to rust when it has an Engine.

A couple of example runs:

Reading the basic_partitioned table:

Reading table at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/basic_partitioned/delta/
version: 1

Building schema
Making a list of lenth 3 with id 0
Asked to visit string named letter belonging to list 0.
Asked to visit long named number belonging to list 0.
Asked to visit double named a_float belonging to list 0.
Schema returned in list 0
Done building schema
Schema:
├─ letter: string
├─ number: long
└─ a_float: double

Starting table scan

Building list of partition columns
Done iterating partition columns
Partition columns are:
  - letter

Iterating scan data

Scan iterator found some data to read
  Of this data, here is a selection vector
    sel[0] = 0
    sel[1] = 1
    sel[2] = 1
    sel[3] = 1
Asking kernel to call us back for each scan row (file to read)
Called back to read file: letter=a/part-00000-b17b8597-4c08-4867-b882-2249d604cbc2.c000.snappy.parquet
  No selection vector for this file
  partition 'letter' here: a
  Reading parquet file at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/basic_partitioned/delta/letter=a/part-00000-b17b8597-4c08-4867-b882-2249d604cbc2.c000.snappy.parquet
  Converting read data to arrow
  Adding partition column 'letter' with value 'a' at column 2
  Added batch to arrow context, have 1 batches in context now
  Done reading parquet file
Called back to read file: letter=e/part-00000-90f8c03f-e005-4714-a8ac-55bcf6d94d7d.c000.snappy.parquet
  No selection vector for this file
  partition 'letter' here: e
  Reading parquet file at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/basic_partitioned/delta/letter=e/part-00000-90f8c03f-e005-4714-a8ac-55bcf6d94d7d.c000.snappy.parquet
  Converting read data to arrow
  Adding partition column 'letter' with value 'e' at column 2
  Added batch to arrow context, have 2 batches in context now
  Done reading parquet file
Called back to read file: letter=f/part-00000-4da5e756-3c84-4b40-b483-ebcf3c4ed6ce.c000.snappy.parquet
  No selection vector for this file
  partition 'letter' here: f
  Reading parquet file at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/basic_partitioned/delta/letter=f/part-00000-4da5e756-3c84-4b40-b483-ebcf3c4ed6ce.c000.snappy.parquet
  Converting read data to arrow
  Adding partition column 'letter' with value 'f' at column 2
  Added batch to arrow context, have 3 batches in context now
  Done reading parquet file

Scan iterator found some data to read
  Of this data, here is a selection vector
    sel[0] = 0
    sel[1] = 0
    sel[2] = 0
    sel[3] = 1
    sel[4] = 1
    sel[5] = 1
Asking kernel to call us back for each scan row (file to read)
Called back to read file: letter=a/part-00000-8fe836ee-ec38-4388-91a1-5c3eff176e84.c000.snappy.parquet
  No selection vector for this file
  partition 'letter' here: a
  Reading parquet file at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/basic_partitioned/delta/letter=a/part-00000-8fe836ee-ec38-4388-91a1-5c3eff176e84.c000.snappy.parquet
  Converting read data to arrow
  Adding partition column 'letter' with value 'a' at column 2
  Added batch to arrow context, have 4 batches in context now
  Done reading parquet file
Called back to read file: letter=b/part-00000-2465ef7c-268e-43be-98fe-c9c37a2a6482.c000.snappy.parquet
  No selection vector for this file
  partition 'letter' here: b
  Reading parquet file at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/basic_partitioned/delta/letter=b/part-00000-2465ef7c-268e-43be-98fe-c9c37a2a6482.c000.snappy.parquet
  Converting read data to arrow
  Adding partition column 'letter' with value 'b' at column 2
  Added batch to arrow context, have 5 batches in context now
  Done reading parquet file
Called back to read file: letter=c/part-00000-28c8b160-ee9c-407d-93f5-ff7d8e2a8d3f.c000.snappy.parquet
  No selection vector for this file
  partition 'letter' here: c
  Reading parquet file at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/basic_partitioned/delta/letter=c/part-00000-28c8b160-ee9c-407d-93f5-ff7d8e2a8d3f.c000.snappy.parquet
  Converting read data to arrow
  Adding partition column 'letter' with value 'c' at column 2
  Added batch to arrow context, have 6 batches in context now
  Done reading parquet file
Scan data iterator done
All done reading table data

Table Data:
-----------

number: int64
a_float: double
letter: string
----
number:
  [
    [
      4
    ],
    [
      5
    ],
  ...,
    [
      2
    ],
    [
      3
    ]
  ]
a_float:
  [
    [
      4.4
    ],
    [
      5.5
    ],
  ...,
    [
      2.2
    ],
    [
      3.3
    ]
  ]
letter:
  [
    [
      "a"
    ],
    [
      "e"
    ],
  ...,
    [
      "b"
    ],
    [
      "c"
    ]
  ]

Reading the nested_types table

Reading table at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/nested_types/delta/
version: 0

Building schema
Making a list of lenth 4 with id 0
Asked to visit integer named pk belonging to list 0.
Making a list of lenth 2 with id 1
Asked to visit double named float64 belonging to list 1.
Asked to visit boolean named bool belonging to list 1.
Asked to visit struct named struct belonging to list 0. Children are in 1.
Making a list of lenth 1 with id 2
Asked to visit short named array_element belonging to list 2.
Asked to visit array named array (contains null: true) belonging to list 0. Types are in 2.
Making a list of lenth 2 with id 3
Asked to visit string named map_key belonging to list 3.
Asked to visit integer named map_value belonging to list 3.
Asked to visit map named map (contains null: true) belonging to list 0. Types are in 3.
Schema returned in list 0
Done building schema
Schema:
├─ pk: integer
├─ struct: struct
│  ├─ float64: double
│  └─ bool: boolean
├─ array (contains null: true): array
│  └─ array_element: short
└─ map (contains null: true): map
   ├─ map_key: string
   └─ map_value: integer

Starting table scan

Building list of partition columns
Done iterating partition columns
Table has no partition columns

Iterating scan data

Scan iterator found some data to read
  Of this data, here is a selection vector
    sel[0] = 0
    sel[1] = 0
    sel[2] = 0
    sel[3] = 1
Asking kernel to call us back for each scan row (file to read)
Called back to read file: part-00000-3b8ca3f0-6444-4ed5-8961-c605cba95bf1-c000.snappy.parquet
  No selection vector for this file
  Reading parquet file at file:///home/nick/databricks/delta-kernel-rs/acceptance/tests/dat/out/reader_tests/generated/nested_types/delta/part-00000-3b8ca3f0-6444-4ed5-8961-c605cba95bf1-c000.snappy.parquet
  Converting read data to arrow
  Added batch to arrow context, have 1 batches in context now
  Done reading parquet file
Scan data iterator done
All done reading table data

Table Data:
-----------

pk: int32
struct: struct<float64: double, bool: bool>
  child 0, float64: double
  child 1, bool: bool
array: list<element: int16>
  child 0, element: int16
map: map<string, int32>
  child 0, entries: struct<key: string not null, value: int32> not null
      child 0, key: string not null
      child 1, value: int32
----
pk:
  [
    [
      0,
      1,
      2,
      3,
      4
    ]
  ]
struct:
  [
    -- is_valid: all not null
    -- child 0 type: double
      [
        0,
        1,
        2,
        3,
        4
      ]
    -- child 1 type: bool
      [
        true,
        false,
        true,
        false,
        true
      ]
  ]
array:
  [
    [
      [
        0
      ],
      [
        0,
        1
      ],
      [
        0,
        1,
        2
      ],
      [
        0,
        1,
        2,
        3
      ],
      [
        0,
        1,
        2,
        3,
        4
      ]
    ]
  ]
map:
  [
    [
      keys:
      []
      values:
      [],
      keys:
      [
        "0"
      ]
      values:
      [
        0
      ],
      keys:
      [
        "0",
        "1"
      ]
      values:
      [
        0,
        1
      ],
      keys:
      [
        "0",
        "1",
        "2"
      ]
      values:
      [
        0,
        1,
        2
      ],
      keys:
      [
        "0",
        "1",
        "2",
        "3"
      ]
      values:
      [
        0,
        1,
        2,
        3
      ]
    ]
  ]

@nicklan nicklan marked this pull request as ready for review May 14, 2024 23:43
@nicklan nicklan requested review from scovich and zachschuermann May 14, 2024 23:43
@nicklan nicklan requested a review from roeap June 7, 2024 20:53
@nicklan
Copy link
Collaborator Author

nicklan commented Jun 7, 2024

@scovich if you have time, I think this is ready for review (but no worries if not!)

@@ -113,7 +114,7 @@ impl TryFromStringSlice for String {
///
/// The slice must be a valid (non-null) pointer, and must point to the indicated number of
/// valid utf8 bytes.
unsafe fn try_from_slice(slice: KernelStringSlice) -> DeltaResult<String> {
unsafe fn try_from_slice(slice: &KernelStringSlice) -> DeltaResult<String> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich this is obviously a significant change. I can't think of a way that it causes an issue though, since we basically only require the lifetime of the slice pointer to be "this function", which was the same as before. We call slice.into() which allocates, so as soon as this returns, the pointer can become invalid and that's fine.

This is more flexible because we can have a struct with a KernelStringSlice, and then can turn it into a String, without having to consume the whole struct.

Copy link
Collaborator

@scovich scovich Jun 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick drive-by response (will try to look more later): How often do we need to consume a kernel string slice from a struct without consuming the struct? And in such cases, should we just create a new slice object since it's anyway just a borrowed reference to a(n externally owned) string?

However, that got me thinking. Currently, KernelStringSlice is neither Copy nor Clone, in an attempt to encourage at least semi-sane semantics. However, it can still be "moved" from, which includes returning it from a function or assigning it to a heap-allocated struct field, so I'm not sure how much protection we really get? It seems like any copy or clone can only outlive the original if it gets moved-from -- e.g. returned from a function -- but the same is already true for the original instance?? Maybe this is another case where the compiler can't save us and we have to rely on idiomatic code patterns instead?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, IIRC we moved to passing by value instead of by reference, because it was causing really weird lifetime issues. So I'd be very cautious moving back to references without understanding that previous issue better. I don't remember details any more, sadly...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Thinking more, the lifetime safety issues mentioned above seem orthogonal to whether try_from_slice takes a value or a reference? And if we look at it in isolation, taking a reference is actually the right approach, because creating a String really doesn't consume the backing string the slice references. Similar to how the language has impl From<&str> for String?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think that captures nicely what I was more vaguely thinking. Basically, because try_from_slice doesn't actually consume the slice, it makes sense to take it as a reference.

I've added to the comment of this function that it doesn't consume the slice. LMK if you have further concerns about this change though.

ffi/examples/read-table/arrow.h Outdated Show resolved Hide resolved
Comment on lines 74 to 75
GArrowStringArrayBuilder* builder = garrow_string_array_builder_new();
for (gint64 i = 0; i < rows; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GArrow doesn't give a way to convert a literal into a column?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I did look for one actually. The docs don't seem to have something like that:

Also the rust lib also doesn't have that, and uses std::iter::repeat

ffi/examples/read-table/arrow.h Outdated Show resolved Hide resolved
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a lot of meaty code for a .h file... is it difficult to set up a multi-object compile for this example? Or just not worth the trouble?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved out

g_object_unref(builder);
if (ret == NULL) {
printf("Error in building boolean array");
if (error != NULL) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't we already prove error is not null at L158?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, because we called garrow_array_builder_finish with &error after that, so now it might not be null. I do not like this way of doing error handling, but that's how glib does it

Comment on lines 82 to 84
// TODO: calling `into_raw` in the visitor causes this to segfault
// perhaps the callback needs to indicate if it took ownership or not
// unsafe { BoxHandle::drop_handle(data_handle) };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should figure out this ownership question ASAP? Risk of memory leak vs. risk of sig11 is not a great trade-off to be making.

We generally have a policy that engine chooses when to drop a handle, once created, and we've been steadily eliminating other kernel code that drops or consumes handles on behalf of engine. Is there a reason this code should be different? Or can we just stick to the policy that the visitor is responsible to ensure the handle gets dropped when the engine no longer needs it? (most likely the visitor would just drop it, but in theory somebody could choose to accumulate handles for some reason)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, so here's my thinking: This data is allocated by the engine (or at least by the Engine::[parquet/json]_handler). So it makes sense that it would be "owned" by the engine, and we wouldn't try and free it. Furthermore, exactly what the example program is doing is probably "normal". That is, you get called back with some data read from a file, your going to put it into the set of results and pass it along. You really don't want this to be freed by the kernel (usually).

So what I've done is:

  1. Make the type be Handle<ExclusiveEngineData>. This to me indicates that the engine owns this data, and that the kernel can't make any promises about it regarding thread safety.
  2. Add a free_engine_data function for the case an engine actually does want the kernel to do the freeing
  3. Add to the comment for this function that the engine owns this data and needs to call the above if it wants kernel to free things

This might get more complex if we need an ExternEngineData, but I think for that we'd have some kind of "into_raw" that engines could use to get back their data and kernel would free just the "rust bits" of it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I quite followed 2/? Doesn't engine always need to call free_engine_data in order to clean up the rust bits? (but 100% agree that kernel should otherwise stay out of it)

ffi/src/engine_funcs.rs Outdated Show resolved Hide resolved
ffi/src/engine_funcs.rs Outdated Show resolved Hide resolved
pub unsafe extern "C" fn snapshot_table_root(
snapshot: Handle<SharedSnapshot>,
allocate_fn: AllocateStringFn,
) -> NullableCvoid {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentionally opaque?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quiet sure I know what you mean. AllocateStringFn returns a void* so the engine can allocate whatever they like, and then we just return that.

I find that style of usage easier from c-land, where you can do something like:

  char* table_root = snapshot_table_root(snapshot, allocate_string);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. The one annoyance is that we lose some type safety -- it becomes difficult to distinguish one void* from another -- especially when there are callbacks involved.

I'm not sure which annoyance is worse, but I generally favor safety over convenience. Thoughts?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how much it matters. AllocateStringFn could be a callback that takes a slice and doesn't return anything. Now we to have a "context" that we include too, but then that would have to be a void*.

I'm not coming up with any other way that doesn't involve some kind of gross global string thing on the c side that the allocate_fn could write into, but perhaps I'm missing something :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I missed that this was engine returning something for its own use (not for kernel's use). Agree that void* is the way to go!

@@ -113,7 +114,7 @@ impl TryFromStringSlice for String {
///
/// The slice must be a valid (non-null) pointer, and must point to the indicated number of
/// valid utf8 bytes.
unsafe fn try_from_slice(slice: KernelStringSlice) -> DeltaResult<String> {
unsafe fn try_from_slice(slice: &KernelStringSlice) -> DeltaResult<String> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: Thinking more, the lifetime safety issues mentioned above seem orthogonal to whether try_from_slice takes a value or a reference? And if we look at it in isolation, taking a reference is actually the right approach, because creating a String really doesn't consume the backing string the slice references. Similar to how the language has impl From<&str> for String?

/// # Safety
/// Caller is responsible for calling with a valid `ExternEngineHandle` and `FileMeta`
#[no_mangle]
pub unsafe extern "C" fn read_parquet_file(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I've renamed this to read_parquet_file and not read_parquet_files, because it just takes a single file as argument.

we will probably want to add a read_parquet_files at some point, but because passing iterators is so much more complex over ffi, I think it's worth it to have this "simple" version.

@nicklan nicklan requested a review from scovich June 22, 2024 01:12
Comment on lines 167 to 169
if (error == NULL) {
GArrowArray* ret = garrow_array_builder_finish((GArrowArrayBuilder*)builder, &error);
g_object_unref(builder);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rescuing #203 (comment), I would suggest making the control flow a bit more obvious, and also reducing indentation:

if (error != NULL) {
  return NULL;
}

if (ret == NULL) {
  ...
}
return (GArrowBooleanArray*)ret;

.visit_date = visit_date,
.visit_timestamp = visit_timestamp,
.visit_timestamp_ntz = visit_timestamp_ntz };
EngineSchemaVisitor visitor = {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: We had discussed in the past how adding new non-nullable struct members could silently break things; and even doing a runtime check would be annoying because the problem wouldn't manifest until the code ran.

One idomatic alternative might be an initializer function that takes all the arguments and stuffs them into a struct instance that gets returned. The arguments are non-defaulted and so any change in arg count would produce compilation errors as desired. We can also stop people from initializing the struct directly if we make it opaque+malloc+free, but I'm not sure that's worth the trouble. Seems like giving people a way to do it safely is enough for those who care about safety?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a nice idea. Created #268

pub unsafe extern "C" fn snapshot_table_root(
snapshot: Handle<SharedSnapshot>,
allocate_fn: AllocateStringFn,
) -> NullableCvoid {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. The one annoyance is that we lose some type safety -- it becomes difficult to distinguish one void* from another -- especially when there are callbacks involved.

I'm not sure which annoyance is worse, but I generally favor safety over convenience. Thoughts?

Copy link
Collaborator

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, couple nits. Sorry for the split review, github weirdness there.

let physical_schema = unsafe { physical_schema.clone_as_arc() };
let res = read_parquet_files_impl(extern_engine.clone(), file, physical_schema);
res.into_extern_result(&extern_engine.as_ref())
let path = unsafe { String::try_from_slice(&file.path) };
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: why is it called try_from_slice if it doesn't return a Result? Should we call it from_slice instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does return a result. I just don't unwrap the result here. Note that read_parquet_file_impl takes a DeltaResult<String> and it uses ? on that arg. That let's us have all the "error" cases handled by the single return from read_parquet_file_impl two lines below.

Copy link
Collaborator

@hntd187 hntd187 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Portability aside LGMT, we should eventually make a working all OS example but for another day

@scovich
Copy link
Collaborator

scovich commented Jul 1, 2024

@nicklan -- Anything blocking merge on this PR?

@nicklan nicklan merged commit 43b2f25 into delta-io:main Jul 1, 2024
9 checks passed
@nicklan
Copy link
Collaborator Author

nicklan commented Jul 1, 2024

@nicklan -- Anything blocking merge on this PR?

Nope, merged! Thanks for the reviews, I know it was a lot of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants