Difficulty using the stream_arrow API #418

ianthetechie · 2024-12-30T04:39:13Z

The stream_arrow API on Statement is rather clunky to use, since it requires a SchemaRef argument. I'm not quite sure why query_arrow doesn't require this, but stream_arrow does.

Additionally, it's not documented what the differences between the APIs are. I would presume that the stream_arrow API is going to be more memory efficient (as best I can tell at a glance, query_arrow loads everything into memory and doesn't release till it's done), but this isn't explicitly documented anywhere.

The text was updated successfully, but these errors were encountered:

ianthetechie · 2024-12-30T06:38:50Z

What I'm currently doing to hack around manually specifying a schema is this:

fn fetch_schema_for_query(db: &Connection, sql: &str) -> duckdb::Result<SchemaRef> {
    // Append "LIMIT 0" so DuckDB returns the same columns but no rows.
    let schema_sql = format!("{} LIMIT 0", sql);

    let mut statement = db.prepare(&schema_sql)?;
    let arrow_result = statement.query_arrow([])?;

    Ok(arrow_result.get_schema())
}

It's extremely hackish in my opinion, but it does work. I feel like I'm either missing something obvious in the API or else it's just not very ergonomic. Any ideas?

I can also confirm (for any googlers landing here) that the stream_arrow API uses significantly less RAM while (in my use case) having roughly equivalent throughput, so it's worth considering if your dataset might not fit in RAM / you'd like to do other things at the same time.

ianthetechie · 2024-12-31T01:27:32Z

It seems that statement.execute([]) has the same effect as the above and is probably less fragile. Though both approaches do seem to end up using some amount of RAM that's proportional with the result set size. Any insights as to why that is?

nicosuave · 2025-01-11T17:36:37Z

The stream_arrow API on Statement is rather clunky to use, since it requires a SchemaRef argument. I'm not quite sure why query_arrow doesn't require this, but stream_arrow does.

Additionally, it's not documented what the differences between the APIs are. I would presume that the stream_arrow API is going to be more memory efficient (as best I can tell at a glance, query_arrow loads everything into memory and doesn't release till it's done), but this isn't explicitly documented anywhere.

You might want to look over the PR that adds the API.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difficulty using the stream_arrow API #418

Difficulty using the stream_arrow API #418

ianthetechie commented Dec 30, 2024

ianthetechie commented Dec 30, 2024

ianthetechie commented Dec 31, 2024

nicosuave commented Jan 11, 2025

Difficulty using the stream_arrow API #418

Difficulty using the stream_arrow API #418

Comments

ianthetechie commented Dec 30, 2024

ianthetechie commented Dec 30, 2024

ianthetechie commented Dec 31, 2024

nicosuave commented Jan 11, 2025