You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What happened:
Writes began failing when attempting to insert high-precision decimal values into a Delta table using the JsonWriter with a Vec<serde_json::Value>. Discovered serde_json was deserializing these values as strings in scientific notation which could not be parsed into the Arrow DecimalType:
Generic DeltaTable error: Failed to convert into Arrow schema: Parser error: can't parse the string value 3.9178294781e-6 to decimal
Some digging uncovered the serde_json feature flag "arbitrary_precision" which retains the value in its full form stored in string format, however this too cannot be decoded to an Arrow DecimalType:
Generic DeltaTable error: Failed to convert into Arrow schema: Json error: whilst decoding field 'decimal_col': expected decimal got {"$serde_json::private::Number": "0.0000039178294781"}
What you expected to happen:
High-precision decimal values be written accurately and successfully to a Delta table
How to reproduce it:
Cargo.toml
[package]
name = "decimal_issue"
version = "0.1.0"
edition = "2021"
[dependencies]
deltalake = "0.15.0"
serde = "1"
serde_json = { version = "1", features = ["arbitrary_precision"] } # remove feature to see original behavior
tokio = "1.33.0"
src/main.rs
use deltalake::{operations::create::CreateBuilder, writer::{JsonWriter, DeltaWriter}};
#[tokio::main]
async fn main() {
let data = serde_json::from_str::<Vec<serde_json::Value>>(r#"[{"decimal_col": 0.0000039178294781}]"#).unwrap();
let table = CreateBuilder::new().with_location("memory://").with_column("decimal_col", deltalake::SchemaDataType::primitive("decimal(38,16)".to_string()), true, None).await.unwrap();
let mut writer = JsonWriter::for_table(&table).unwrap();
match writer.write(data).await {
Ok(_) => {},
Err(err) => {
eprintln!("{}", err);
std::process::exit(1);
}
}
}
cargo run
More details:
Lower precision decimals (5 or less) do not have this issue.
Using arrow_json to parse the value into a RecordBatch and then using RecordBatchWriter instead of serde_json with JsonWriter works, however the problem here is other Delta log interactions such as create_checkpoint use serde_json behind the scenes, so when the stats are read from the logs to be written to Parquet checkpoints the same issue occurs.
The text was updated successfully, but these errors were encountered:
There are several concerns going on here. First, there are shortcomings in arrow causing issues with arbitrary_precision and scientific notation. I have opened two feature requests in the arrow-rs project to address these:
Second, delta-rs is using f64 as a stand-in for decimals, causing precision loss. I know Rust does not have a native decimal type, but this seems like a big oversight. For now I've added the BigDecimal crate to a fork of this library. If this seems like the right direction for delta-rs broadly I'm happy to clean it up and submit a PR to this repo.
However there may be a bug regarding writing decimal values through the json writer anyways. Does this error only apply to high-precision decimals, or decimals in general?
Environment
Delta-rs version:
0.15.0 (also tried 0.16.1)
Binding:
Rust
Environment:
Bug
What happened:
Writes began failing when attempting to insert high-precision decimal values into a Delta table using the JsonWriter with a
Vec<serde_json::Value>
. Discoveredserde_json
was deserializing these values as strings in scientific notation which could not be parsed into the Arrow DecimalType:Generic DeltaTable error: Failed to convert into Arrow schema: Parser error: can't parse the string value 3.9178294781e-6 to decimal
Some digging uncovered the
serde_json
feature flag"arbitrary_precision"
which retains the value in its full form stored in string format, however this too cannot be decoded to an Arrow DecimalType:Generic DeltaTable error: Failed to convert into Arrow schema: Json error: whilst decoding field 'decimal_col': expected decimal got {"$serde_json::private::Number": "0.0000039178294781"}
What you expected to happen:
High-precision decimal values be written accurately and successfully to a Delta table
How to reproduce it:
Cargo.toml
src/main.rs
cargo run
More details:
Lower precision decimals (5 or less) do not have this issue.
Using
arrow_json
to parse the value into a RecordBatch and then using RecordBatchWriter instead ofserde_json
with JsonWriter works, however the problem here is other Delta log interactions such as create_checkpoint useserde_json
behind the scenes, so when the stats are read from the logs to be written to Parquet checkpoints the same issue occurs.The text was updated successfully, but these errors were encountered: