-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Have an ffi c example that actually reads data #203
Conversation
@scovich if you have time, I think this is ready for review (but no worries if not!) |
@@ -113,7 +114,7 @@ impl TryFromStringSlice for String { | |||
/// | |||
/// The slice must be a valid (non-null) pointer, and must point to the indicated number of | |||
/// valid utf8 bytes. | |||
unsafe fn try_from_slice(slice: KernelStringSlice) -> DeltaResult<String> { | |||
unsafe fn try_from_slice(slice: &KernelStringSlice) -> DeltaResult<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scovich this is obviously a significant change. I can't think of a way that it causes an issue though, since we basically only require the lifetime of the slice pointer to be "this function", which was the same as before. We call slice.into()
which allocates, so as soon as this returns, the pointer can become invalid and that's fine.
This is more flexible because we can have a struct with a KernelStringSlice, and then can turn it into a String, without having to consume the whole struct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick drive-by response (will try to look more later): How often do we need to consume a kernel string slice from a struct without consuming the struct? And in such cases, should we just create a new slice object since it's anyway just a borrowed reference to a(n externally owned) string?
However, that got me thinking. Currently, KernelStringSlice
is neither Copy
nor Clone
, in an attempt to encourage at least semi-sane semantics. However, it can still be "moved" from, which includes returning it from a function or assigning it to a heap-allocated struct field, so I'm not sure how much protection we really get? It seems like any copy or clone can only outlive the original if it gets moved-from -- e.g. returned from a function -- but the same is already true for the original instance?? Maybe this is another case where the compiler can't save us and we have to rely on idiomatic code patterns instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, IIRC we moved to passing by value instead of by reference, because it was causing really weird lifetime issues. So I'd be very cautious moving back to references without understanding that previous issue better. I don't remember details any more, sadly...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Thinking more, the lifetime safety issues mentioned above seem orthogonal to whether try_from_slice
takes a value or a reference? And if we look at it in isolation, taking a reference is actually the right approach, because creating a String
really doesn't consume the backing string the slice references. Similar to how the language has impl From<&str> for String?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think that captures nicely what I was more vaguely thinking. Basically, because try_from_slice
doesn't actually consume the slice, it makes sense to take it as a reference.
I've added to the comment of this function that it doesn't consume the slice. LMK if you have further concerns about this change though.
ffi/examples/read-table/arrow.h
Outdated
GArrowStringArrayBuilder* builder = garrow_string_array_builder_new(); | ||
for (gint64 i = 0; i < rows; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GArrow doesn't give a way to convert a literal into a column?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so, I did look for one actually. The docs don't seem to have something like that:
Also the rust lib also doesn't have that, and uses std::iter::repeat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a lot of meaty code for a .h file... is it difficult to set up a multi-object compile for this example? Or just not worth the trouble?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved out
ffi/examples/read-table/arrow.h
Outdated
g_object_unref(builder); | ||
if (ret == NULL) { | ||
printf("Error in building boolean array"); | ||
if (error != NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we already prove error is not null at L158?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, because we called garrow_array_builder_finish
with &error
after that, so now it might not be null. I do not like this way of doing error handling, but that's how glib does it
ffi/src/engine_funcs.rs
Outdated
// TODO: calling `into_raw` in the visitor causes this to segfault | ||
// perhaps the callback needs to indicate if it took ownership or not | ||
// unsafe { BoxHandle::drop_handle(data_handle) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should figure out this ownership question ASAP? Risk of memory leak vs. risk of sig11 is not a great trade-off to be making.
We generally have a policy that engine chooses when to drop a handle, once created, and we've been steadily eliminating other kernel code that drops or consumes handles on behalf of engine. Is there a reason this code should be different? Or can we just stick to the policy that the visitor is responsible to ensure the handle gets dropped when the engine no longer needs it? (most likely the visitor would just drop it, but in theory somebody could choose to accumulate handles for some reason)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, so here's my thinking: This data is allocated by the engine (or at least by the Engine::[parquet/json]_handler
). So it makes sense that it would be "owned" by the engine, and we wouldn't try and free it. Furthermore, exactly what the example program is doing is probably "normal". That is, you get called back with some data read from a file, your going to put it into the set of results and pass it along. You really don't want this to be freed by the kernel (usually).
So what I've done is:
- Make the type be
Handle<ExclusiveEngineData>
. This to me indicates that the engine owns this data, and that the kernel can't make any promises about it regarding thread safety. - Add a
free_engine_data
function for the case an engine actually does want the kernel to do the freeing - Add to the comment for this function that the engine owns this data and needs to call the above if it wants kernel to free things
This might get more complex if we need an ExternEngineData
, but I think for that we'd have some kind of "into_raw" that engines could use to get back their data and kernel would free just the "rust bits" of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I quite followed 2/? Doesn't engine always need to call free_engine_data
in order to clean up the rust bits? (but 100% agree that kernel should otherwise stay out of it)
pub unsafe extern "C" fn snapshot_table_root( | ||
snapshot: Handle<SharedSnapshot>, | ||
allocate_fn: AllocateStringFn, | ||
) -> NullableCvoid { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intentionally opaque?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quiet sure I know what you mean. AllocateStringFn
returns a void*
so the engine can allocate whatever they like, and then we just return that.
I find that style of usage easier from c-land, where you can do something like:
char* table_root = snapshot_table_root(snapshot, allocate_string);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. The one annoyance is that we lose some type safety -- it becomes difficult to distinguish one void* from another -- especially when there are callbacks involved.
I'm not sure which annoyance is worse, but I generally favor safety over convenience. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how much it matters. AllocateStringFn
could be a callback that takes a slice and doesn't return anything. Now we to have a "context" that we include too, but then that would have to be a void*
.
I'm not coming up with any other way that doesn't involve some kind of gross global string thing on the c side that the allocate_fn
could write into, but perhaps I'm missing something :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I missed that this was engine returning something for its own use (not for kernel's use). Agree that void* is the way to go!
@@ -113,7 +114,7 @@ impl TryFromStringSlice for String { | |||
/// | |||
/// The slice must be a valid (non-null) pointer, and must point to the indicated number of | |||
/// valid utf8 bytes. | |||
unsafe fn try_from_slice(slice: KernelStringSlice) -> DeltaResult<String> { | |||
unsafe fn try_from_slice(slice: &KernelStringSlice) -> DeltaResult<String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: Thinking more, the lifetime safety issues mentioned above seem orthogonal to whether try_from_slice
takes a value or a reference? And if we look at it in isolation, taking a reference is actually the right approach, because creating a String
really doesn't consume the backing string the slice references. Similar to how the language has impl From<&str> for String?
/// # Safety | ||
/// Caller is responsible for calling with a valid `ExternEngineHandle` and `FileMeta` | ||
#[no_mangle] | ||
pub unsafe extern "C" fn read_parquet_file( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note: I've renamed this to read_parquet_file
and not read_parquet_files
, because it just takes a single file as argument.
we will probably want to add a read_parquet_files
at some point, but because passing iterators is so much more complex over ffi, I think it's worth it to have this "simple" version.
ffi/examples/read-table/arrow.c
Outdated
if (error == NULL) { | ||
GArrowArray* ret = garrow_array_builder_finish((GArrowArrayBuilder*)builder, &error); | ||
g_object_unref(builder); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rescuing #203 (comment), I would suggest making the control flow a bit more obvious, and also reducing indentation:
if (error != NULL) {
return NULL;
}
if (ret == NULL) {
...
}
return (GArrowBooleanArray*)ret;
.visit_date = visit_date, | ||
.visit_timestamp = visit_timestamp, | ||
.visit_timestamp_ntz = visit_timestamp_ntz }; | ||
EngineSchemaVisitor visitor = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: We had discussed in the past how adding new non-nullable struct members could silently break things; and even doing a runtime check would be annoying because the problem wouldn't manifest until the code ran.
One idomatic alternative might be an initializer function that takes all the arguments and stuffs them into a struct instance that gets returned. The arguments are non-defaulted and so any change in arg count would produce compilation errors as desired. We can also stop people from initializing the struct directly if we make it opaque+malloc+free, but I'm not sure that's worth the trouble. Seems like giving people a way to do it safely is enough for those who care about safety?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a nice idea. Created #268
pub unsafe extern "C" fn snapshot_table_root( | ||
snapshot: Handle<SharedSnapshot>, | ||
allocate_fn: AllocateStringFn, | ||
) -> NullableCvoid { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. The one annoyance is that we lose some type safety -- it becomes difficult to distinguish one void* from another -- especially when there are callbacks involved.
I'm not sure which annoyance is worse, but I generally favor safety over convenience. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, couple nits. Sorry for the split review, github weirdness there.
let physical_schema = unsafe { physical_schema.clone_as_arc() }; | ||
let res = read_parquet_files_impl(extern_engine.clone(), file, physical_schema); | ||
res.into_extern_result(&extern_engine.as_ref()) | ||
let path = unsafe { String::try_from_slice(&file.path) }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aside: why is it called try_from_slice
if it doesn't return a Result
? Should we call it from_slice
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does return a result. I just don't unwrap the result here. Note that read_parquet_file_impl
takes a DeltaResult<String>
and it uses ?
on that arg. That let's us have all the "error" cases handled by the single return from read_parquet_file_impl
two lines below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Portability aside LGMT, we should eventually make a working all OS example but for another day
@nicklan -- Anything blocking merge on this PR? |
Nope, merged! Thanks for the reviews, I know it was a lot of code. |
Flesh out what a more full FFI API for data reading would look like. We can certainly iterate on this.
Lots more into the example for using the c-ffi (sorry for another big PR. It's mostly example code at least):
clang-format
to get consistent c styleUnfortunately the glib arrow api doesn't seem to allow passing options to how we print things, so this will print a max of 4 rows out the table. To print more we'd need to write our own printing function.
This also adds
engine_funcs.rs
which currently only exposesread_parquet_files
. This is more of a POC, but seems like potentially the right way forward for exposing the various functions available to rust when it has anEngine
.A couple of example runs:
Reading the
basic_partitioned
table:Reading the
nested_types
table