Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to write high-precision decimal values to Delta table using serde_json/JsonWriter #1778

Closed
ryanaston opened this issue Oct 26, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@ryanaston
Copy link

Environment

Delta-rs version:
0.15.0 (also tried 0.16.1)

Binding:
Rust

Environment:

  • Cloud provider: any
  • OS: any
  • Other: n/a

Bug

What happened:
Writes began failing when attempting to insert high-precision decimal values into a Delta table using the JsonWriter with a Vec<serde_json::Value>. Discovered serde_json was deserializing these values as strings in scientific notation which could not be parsed into the Arrow DecimalType:

Generic DeltaTable error: Failed to convert into Arrow schema: Parser error: can't parse the string value 3.9178294781e-6 to decimal

Some digging uncovered the serde_json feature flag "arbitrary_precision" which retains the value in its full form stored in string format, however this too cannot be decoded to an Arrow DecimalType:

Generic DeltaTable error: Failed to convert into Arrow schema: Json error: whilst decoding field 'decimal_col': expected decimal got {"$serde_json::private::Number": "0.0000039178294781"}

What you expected to happen:
High-precision decimal values be written accurately and successfully to a Delta table

How to reproduce it:

Cargo.toml

[package]
name = "decimal_issue"
version = "0.1.0"
edition = "2021"

[dependencies]
deltalake = "0.15.0"
serde = "1"
serde_json = { version = "1", features = ["arbitrary_precision"] } # remove feature to see original behavior
tokio = "1.33.0"

src/main.rs

use deltalake::{operations::create::CreateBuilder, writer::{JsonWriter, DeltaWriter}};

#[tokio::main]
async fn main() {
    let data = serde_json::from_str::<Vec<serde_json::Value>>(r#"[{"decimal_col": 0.0000039178294781}]"#).unwrap();
    let table = CreateBuilder::new().with_location("memory://").with_column("decimal_col", deltalake::SchemaDataType::primitive("decimal(38,16)".to_string()), true, None).await.unwrap();
    let mut writer = JsonWriter::for_table(&table).unwrap();

    match writer.write(data).await {
        Ok(_) => {},
        Err(err) => {
            eprintln!("{}", err);
            std::process::exit(1);
        }
    }
}

cargo run

More details:
Lower precision decimals (5 or less) do not have this issue.

Using arrow_json to parse the value into a RecordBatch and then using RecordBatchWriter instead of serde_json with JsonWriter works, however the problem here is other Delta log interactions such as create_checkpoint use serde_json behind the scenes, so when the stats are read from the logs to be written to Parquet checkpoints the same issue occurs.

@ryanaston ryanaston added the bug Something isn't working label Oct 26, 2023
@ryanaston
Copy link
Author

Update:

There are several concerns going on here. First, there are shortcomings in arrow causing issues with arbitrary_precision and scientific notation. I have opened two feature requests in the arrow-rs project to address these:

  1. Decimal enhancements in arrow-cast apache/arrow-rs#5068
  2. Support for serde_json arbitrary_precision in arrow-json TapeSerializer apache/arrow-rs#5069

Second, delta-rs is using f64 as a stand-in for decimals, causing precision loss. I know Rust does not have a native decimal type, but this seems like a big oversight. For now I've added the BigDecimal crate to a fork of this library. If this seems like the right direction for delta-rs broadly I'm happy to clean it up and submit a PR to this repo.

@roeap
Copy link
Collaborator

roeap commented Jan 28, 2024

@ryanaston - of course we always happy about PRs.

In this case we may have the challenge, that we need to be true to the delta protocol, which at most supports precision / scale up to 38.

However there may be a bug regarding writing decimal values through the json writer anyways. Does this error only apply to high-precision decimals, or decimals in general?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants