Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support sending schemas for empty streams #3594

Merged
merged 8 commits into from
Jan 26, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jan 24, 2023

Which issue does this PR close?

Closes #3591

Rationale for this change

Some Flight clients expect to see Schema messages even if there is no data (e.g. an empty result set)

What changes are included in this PR?

  1. Add with_schema to FlightDataEncoderBuilder to specify schema when creating streams
  2. Allow users to retrieve the decoded schema in the FlightDataDecoder
  3. Tests

Are there any user-facing changes?

  1. FlightDataDecoder now has schema() that returns the actual schema rather than got_schema() which returns a boolean

  2. FlightDataEncoderBuilder::with_schema

@alamb alamb added the api-change Changes to the arrow API label Jan 24, 2023
@github-actions github-actions bot added the arrow-flight Changes to the arrow-flight crate label Jan 24, 2023
Copy link
Contributor

@tustvold tustvold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor nits

arrow-flight/src/encode.rs Outdated Show resolved Hide resolved
arrow-flight/src/decode.rs Outdated Show resolved Hide resolved
arrow-flight/src/decode.rs Outdated Show resolved Hide resolved
arrow-flight/src/decode.rs Show resolved Hide resolved
arrow-flight/src/encode.rs Outdated Show resolved Hide resolved
/// is not specified, an encoded Schema message will be sent when
/// the first [`RecordBatch`], if any, is encoded. Some clients
/// expect a Schema message even if there is no data sent.
pub fn with_schema(mut self, schema: SchemaRef) -> Self {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it will be more consistent to always require specifying schema instead of two mixed approaches (specifying it by this, getting from first RecordBatch).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was under the impression that the Flight protocol did not require sending a schema , however the format docs don't seem to specify one way or the other; https://arrow.apache.org/docs/format/Flight.html

Requiring a schema might make the API easier to use correctly with other Flight clients (I am pretty sure the issue we hit upstream in IOx was interoperability with the golang client).

Downstream in DataFusion we almost always have a https://docs.rs/datafusion/16.1.0/datafusion/physical_plan/type.SendableRecordBatchStream.html which has a schema attached.

So in summary, I could go either way. Does anyone else have any thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I hit an issue downstream https://github.com/influxdata/influxdb_iox/pull/6711 that makes me think @viirya 's idea to always require a schema is the best (otherwise it is easy to use this interface incorrectly). I'll make another PR shortly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what requiring a schema to encode the stream would look like: #3609

Co-authored-by: Raphael Taylor-Davies <[email protected]>
@tustvold tustvold removed the api-change Changes to the arrow API label Jan 26, 2023
@tustvold tustvold merged commit 902a17d into apache:master Jan 26, 2023
@ursabot
Copy link

ursabot commented Jan 26, 2023

Benchmark runs are scheduled for baseline = 1afefbb and contender = 902a17d. 902a17d is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow-flight Changes to the arrow-flight crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FlightDataEncoder Optionally send Schema even when no record batches
4 participants