Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difficulty using the stream_arrow API #418

Open
ianthetechie opened this issue Dec 30, 2024 · 3 comments
Open

Difficulty using the stream_arrow API #418

ianthetechie opened this issue Dec 30, 2024 · 3 comments

Comments

@ianthetechie
Copy link

The stream_arrow API on Statement is rather clunky to use, since it requires a SchemaRef argument. I'm not quite sure why query_arrow doesn't require this, but stream_arrow does.

Additionally, it's not documented what the differences between the APIs are. I would presume that the stream_arrow API is going to be more memory efficient (as best I can tell at a glance, query_arrow loads everything into memory and doesn't release till it's done), but this isn't explicitly documented anywhere.

@ianthetechie
Copy link
Author

What I'm currently doing to hack around manually specifying a schema is this:

fn fetch_schema_for_query(db: &Connection, sql: &str) -> duckdb::Result<SchemaRef> {
    // Append "LIMIT 0" so DuckDB returns the same columns but no rows.
    let schema_sql = format!("{} LIMIT 0", sql);

    let mut statement = db.prepare(&schema_sql)?;
    let arrow_result = statement.query_arrow([])?;

    Ok(arrow_result.get_schema())
}

It's extremely hackish in my opinion, but it does work. I feel like I'm either missing something obvious in the API or else it's just not very ergonomic. Any ideas?

I can also confirm (for any googlers landing here) that the stream_arrow API uses significantly less RAM while (in my use case) having roughly equivalent throughput, so it's worth considering if your dataset might not fit in RAM / you'd like to do other things at the same time.

@ianthetechie
Copy link
Author

It seems that statement.execute([]) has the same effect as the above and is probably less fragile. Though both approaches do seem to end up using some amount of RAM that's proportional with the result set size. Any insights as to why that is?

@nicosuave
Copy link

The stream_arrow API on Statement is rather clunky to use, since it requires a SchemaRef argument. I'm not quite sure why query_arrow doesn't require this, but stream_arrow does.

Additionally, it's not documented what the differences between the APIs are. I would presume that the stream_arrow API is going to be more memory efficient (as best I can tell at a glance, query_arrow loads everything into memory and doesn't release till it's done), but this isn't explicitly documented anywhere.

You might want to look over the PR that adds the API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
@nicosuave @ianthetechie and others