-
Notifications
You must be signed in to change notification settings - Fork 433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to interface with data written from Spark Databricks #1651
Comments
When you say encoding of FWIW, the file paths shouldn't be consequential as long as they can be read and recognized. The partition values are taken from the log, not the directory structure. |
I was referring to the file paths. I'll do more digging and see how the partitions in the logs compare and update here with more findings. |
To put another way, are you actually getting an error or failure? Or are you just confused by what the file paths look like? |
I still seem to be running into issues reading from delta tables partitioned by datetime # write_table.py
import datetime
from deltalake import write_deltalake
import pyarrow as pa
data = pa.table({"id": pa.array([425], type=pa.int32()),
"data": pa.array(["python-module-test-write"]),
"t": pa.array([datetime.datetime(2023, 9, 15)])})
write_deltalake(table_or_uri="./dt", \
mode="append", \
data=data, \
partition_by=["t"]
) # read_table.py
from deltalake import DeltaTable
dt = DeltaTable(table_uri="./dt")
dataset = dt.to_pyarrow_dataset()
print(dataset.count_rows()) > python read_table.py
Traceback (most recent call last):
File "/Users/crathbone/offline-spark/simple/read_table.py", line 4, in <module>
dataset = dt.to_pyarrow_dataset()
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/site-packages/deltalake/table.py", line 540, in to_pyarrow_dataset
for file, part_expression in self._table.dataset_partitions(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/scalar.pxi", line 88, in pyarrow.lib.Scalar.cast
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: error parsing '2023-09-15%2000%3A00%3A00.000000' as scalar of type timestamp[us] |
Environment
Delta-rs version: 0.10.2
Binding: python / rust
Environment:
Bug
What happened:
When attempting to interface with Databricks we're getting inconsistent results for the encoding of the partition resulting in inability to interface across clients.
After the fix from #1613 we're getting closer but still not consistent with Databricks.
Python
When writing from python the partitions are formatted as:
partition_date=2023-09-15%2000%3A00%3A00.000000
Rust
When writing from rust we see it as:
partition_date=2023-09-15 00:00:00
Databricks
When writing from Spark Databricks to Delta Lake we see partial encoding:
partition_date=2023-09-15 00%3A00$3A00
What you expected to happen:
Would expect to have consistent encoding across platforms.
How to reproduce it:
Write to Azure using Databricks, see partition layout.
Sample Python run locally:
The text was updated successfully, but these errors were encountered: