Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds optional serde support to datafusion-proto #2892

Merged
merged 5 commits into from
Jul 17, 2022

Conversation

tustvold
Copy link
Contributor

Which issue does this PR close?

Closes #2889

Rationale for this change

This provides a way to serialize the datafusion-proto messages to any serde serializer, e.g. JSON, based on the protobuf JSON specification.

What changes are included in this PR?

Uses pbjson to auto-generate the serde implementations for the protobuf messages. Full disclosure: I am the primary author and maintainer of this crate.

Are there any user-facing changes?

No

macro_rules! roundtrip_expr_test {
($initial_struct:ident, $ctx:ident) => {
let proto: protobuf::LogicalExprNode = (&$initial_struct).try_into().unwrap();
fn roundtrip_expr_test<T, E>(initial_struct: T, ctx: SessionContext)
Copy link
Contributor Author

@tustvold tustvold Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive by refactor to eliminate an unnecessary macro, as they complicate development, refactoring, debugging, etc...

@codecov-commenter
Copy link

codecov-commenter commented Jul 13, 2022

Codecov Report

Merging #2892 (2d6c87a) into master (eed77a2) will decrease coverage by 0.02%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #2892      +/-   ##
==========================================
- Coverage   85.34%   85.32%   -0.03%     
==========================================
  Files         276      273       -3     
  Lines       49290    49351      +61     
==========================================
+ Hits        42069    42110      +41     
- Misses       7221     7241      +20     
Impacted Files Coverage Δ
datafusion/proto/src/bytes/mod.rs 82.75% <ø> (ø)
datafusion/proto/src/lib.rs 93.32% <100.00%> (+0.10%) ⬆️
datafusion/optimizer/src/optimizer.rs 82.69% <0.00%> (-5.55%) ⬇️
datafusion/expr/src/expr.rs 83.60% <0.00%> (-2.50%) ⬇️
datafusion/expr/src/window_frame.rs 92.43% <0.00%> (-0.85%) ⬇️
datafusion/core/src/physical_plan/metrics/value.rs 86.93% <0.00%> (-0.51%) ⬇️
datafusion/expr/src/binary_rule.rs 84.65% <0.00%> (-0.39%) ⬇️
datafusion/expr/src/logical_plan/builder.rs 89.64% <0.00%> (-0.22%) ⬇️
datafusion/sql/src/planner.rs 81.31% <0.00%> (-0.07%) ⬇️
datafusion/optimizer/src/projection_push_down.rs 98.06% <0.00%> (-0.03%) ⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eed77a2...2d6c87a. Read the comment docs.

E: Debug,
{
let proto: protobuf::LogicalExprNode = (&initial_struct).try_into().unwrap();
let round_trip: Expr = parse_expr(&proto, &ctx).unwrap();
Copy link
Contributor

@avantgardnerio avantgardnerio Jul 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this PR! I'm sure it's a step in the correct direction, but after cloning it, I'm having trouble figuring out how to extend it to include LogicalPlans as well. There appears to be no equivalent parse_plan()?

If I'm reading this correctly, it looks like the protobuf generator generated an entirely separate copy of all the Expr & LogicalPlan enums, and we created some fairly large functions to convert back and forth between them like parse_expr? If so the next step would be to write parse_plan?

It does look from other tests like we are able to roundtrip plans to protobuf, so maybe I am missing something. When I tried:

        let protobuf =
            protobuf::LogicalPlanNode::try_from_logical_plan(&topk_plan, &extension_codec)?;
        let string = serde_json::to_string(&protobuf).unwrap();

I get

     |         let string = serde_json::to_string(&protobuf).unwrap();
     |                      --------------------- ^^^^^^^^^ the trait `serde::ser::Serialize` is not implemented for `LogicalPlanNode`
     |                      |
     |                      required by a bound introduced by this call

Sorry, I'm a bit lost. Any help is appreciated :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was running my tests without the "serde" feature 🤦

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I got all that sorted, and was able to do what I was trying to use this PR for: tustvold#63

I was hoping (naively) to see something like:

{Projection: { field: "#employee_csv.id",
  Filter: {expr: {left: "#employee_csv.state", op: "IN", right: {
    Subquery: {
      TableScan: {name: "employee_csv", projection=["state"]},
    TableScan: {name: "employee_csv" projection=["id", "state"]
}}

Unfortunately (though serializable and deserializable) the result is not as human readable as I hoped, and probably not something I'd want to check into a test assertion: https://gist.github.com/avantgardnerio/35a04950ca768fdfe6579aea08126b74

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above being said, I can see other uses for JSON serialization that extend beyond my narrow use-case that could make this PR valuable enough to merge on its own merits.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running that raw JSON through this formatter produces some encouraging output:

{
  "projection": {
    "input": {
      "sort": {
        "input": {
          "projection": {
            "input": {
              "listingScan": {
                "tableName": "alltypes_plain",
                "paths": ["file:///home/bgardner/workspace/arrow-datafusion/parquet-testing/data/alltypes_plain.parquet"],
                "fileExtension": ".parquet",
                "schema": {
                  "columns": [
                    {"name": "id", "arrowType": {"INT32": {}}, "nullable": true},
                    {"name": "bool_col", "arrowType": {"BOOL": {}}, "nullable": true},
                    {"name": "tinyint_col", "arrowType": {"INT32": {}}, "nullable": true},
                    {"name": "smallint_col", "arrowType": {"INT32": {}}, "nullable": true},
                    {"name": "int_col", "arrowType": {"INT32": {}}, "nullable": true},
                    {"name": "bigint_col", "arrowType": {"INT64": {}}, "nullable": true},
                    {"name": "float_col", "arrowType": {"FLOAT32": {}}, "nullable": true},
                    {"name": "double_col", "arrowType": {"FLOAT64": {}}, "nullable": true},
                    {"name": "date_string_col", "arrowType": {"BINARY": {}}, "nullable": true},
                    {"name": "string_col", "arrowType": {"BINARY": {}}, "nullable": true},
                    {"name": "timestamp_col", "arrowType": {"TIMESTAMP": {"timeUnit": "Nanosecond"}}, "nullable": true}
                  ]
                },
                "collectStat": true,
                "targetPartitions": 16,
                "parquet": {"enablePruning": true}
              }
            },
            "expr": [
              {"column": {"name": "id", "relation": {"relation": "alltypes_plain"}}},
              {"column": {"name": "int_col", "relation": {"relation": "alltypes_plain"}}},
              {"column": {"name": "double_col", "relation": {"relation": "alltypes_plain"}}}
            ]
          }
        },
        "expr": [
          {"sort": {"expr": {"column": {"name": "int_col", "relation": {"relation": "alltypes_plain"}}}, "asc": true}},
          {"sort": {"expr": {"column": {"name": "double_col", "relation": {"relation": "alltypes_plain"}}}, "asc": true}}
        ]
      }
    },
    "expr": [{"column": {"name": "id", "relation": {"relation": "alltypes_plain"}}}]
  }
}

Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After messing around a bit, I'm convinced this is just what we need. It does look like the DefaultExtensionCodec is not exposed in other packages? So maybe it makes sense to expose methods like logical_plan_to_json() for similar reasons as logical_plan_to_bytes() - making it easier to use from outside the proto crate...

Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please take a look at tustvold#64 before merging into Apache. I think those couple tweaks are important.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

* Misc suggestions

* Update datafusion/proto/Cargo.toml

Co-authored-by: Raphael Taylor-Davies <[email protected]>

Co-authored-by: Raphael Taylor-Davies <[email protected]>
serde = { version = "1.0", optional = true }
serde_json = { version = "1.0", optional = true }
pbjson = { version = "0.3", optional = true }
pbjson-types = { version = "0.3", optional = true }

[dev-dependencies]
doc-comment = "0.3"
tokio = "1.18"

[build-dependencies]
tonic-build = { version = "0.7" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tonic-build = { version = "0.7" }
pbjson-build = { version = "0.3", optional = true }


[dev-dependencies]
doc-comment = "0.3"
tokio = "1.18"

[build-dependencies]
tonic-build = { version = "0.7" }
pbjson-build = { version = "0.3", optional = true }
prost-build = { version = "0.7" }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was actually no reason for this to depend on tonic-build, and not just prost-build. This is likely an orphan from when ballista was extracted

Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cargo fix sugestions

@@ -33,18 +33,24 @@ name = "datafusion_proto"
path = "src/lib.rs"

[features]
default = []
json = ["pbjson", "pbjson-build", "serde", "serde_json"]

[dependencies]
arrow = { version = "18.0.0" }
datafusion = { path = "../core", version = "10.0.0" }
datafusion-common = { path = "../common", version = "10.0.0" }
datafusion-expr = { path = "../expr", version = "10.0.0" }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
datafusion-expr = { path = "../expr", version = "10.0.0" }
datafusion-expr = { path = "../expr", version = "10.0.0" }
pbjson = { version = "0.3", optional = true }
pbjson-types = { version = "0.3", optional = true }

serde = { version = "1.0", optional = true }
serde_json = { version = "1.0", optional = true }
pbjson = { version = "0.3", optional = true }
pbjson-types = { version = "0.3", optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pbjson-types = { version = "0.3", optional = true }


serde = { version = "1.0", optional = true }
serde_json = { version = "1.0", optional = true }
pbjson = { version = "0.3", optional = true }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pbjson = { version = "0.3", optional = true }

@tustvold tustvold merged commit c67161b into apache:master Jul 17, 2022
@ursabot
Copy link

ursabot commented Jul 17, 2022

Benchmark runs are scheduled for baseline = c528986 and contender = c67161b. c67161b is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

JSON version of display_indent()
5 participants