Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rust engine doesn't correctly seralize path for partions on timestamp on Windows #2382

Closed
thomasfrederikhoeck opened this issue Apr 4, 2024 · 5 comments · Fixed by #2994
Closed
Labels
bug Something isn't working

Comments

@thomasfrederikhoeck
Copy link
Contributor

thomasfrederikhoeck commented Apr 4, 2024

Environment

Delta-rs version: main

Binding: python

Environment:

  • Cloud provider:
  • OS: Windows
  • Other:

Bug

What happened:
When using the rust engine timestamps are serialized with colon (:) it the file path. This does not work on Windows.
OSError: Generic LocalFileSystem error: Unable to open file C:\projects\delta-rs\mytable\time=2021-01-02 03:04:06.000003\part-00001-2be14fa0-e4f4-4fc0-bf61-6779b08cf550-c000.snappy.parquet#1: The filename, directory name, or volume label syntax is incorrect. (os error 123)

What you expected to happen:

That time was seralized like in pyarrow: time=2021-01-01%2003%3A04%3A06.000003
How to reproduce it:
Run the following on windows

import pandas as pd
from deltalake import write_deltalake

dates = pd.date_range(datetime(2021,1,1,3,4,6,3),datetime(2021,1,3,3,4,6))
df = pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})
write_deltalake("mytable",df, partition_by="time", mode="overwrite",engine="rust")

More details:

@ion-elgreco
Copy link
Collaborator

In python we use the FileSystemHandler from src/filesystem.rs, this always normalizes the path:

    fn normalize_path(&self, path: String) -> PyResult<String> {
        let suffix = if path.ends_with('/') { "/" } else { "" };
        let path = Path::parse(path).unwrap();
        Ok(format!("{path}{suffix}"))
    }

Encode
In theory object stores support any UTF-8 character sequence, however, certain character sequences cause compatibility problems with some applications and protocols. Additionally some filesystems may impose character restrictions, see [LocalFileSystem]. As such the naming guidelines for [S3], [GCS] and [Azure Blob Storage] all recommend sticking to a limited character subset.

A string containing potentially problematic path segments can therefore be encoded to a [Path] using [Path::from]or [Path::from_iter]. This will percent encode any problematic segments according to [RFC 1738].

@thomasfrederikhoeck
Copy link
Contributor Author

thomasfrederikhoeck commented Apr 4, 2024

Isn't that quote saying that Path::from should be used? Just tried - didn't work. This doesn't mention colon and Windows so I don't know if it is missing in object store? https://docs.rs/object_store/latest/object_store/local/struct.LocalFileSystem.html#path-semantics-1

Windows forbids certain ASCII characters, e.g. < or |

@thomasfrederikhoeck
Copy link
Contributor Author

@ion-elgreco I just tried out some different examples. It looks like Path::from handles illegal Windows characters such as < and | while Path::parse doesn't. But it appears that : is not handle in any case.

use object_store::path::{Path};

fn main() {
    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\file.parquet".to_string();
    // let path: Result<Path, object_store::path::Error> = Path::parse(path);
    let path_from = Path::from(path);
    println!( "{:?}", path_from);

    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\file.parquet".to_string();
    let path_parse: Result<Path, object_store::path::Error> = Path::parse(path);
    println!( "{:?}", path_parse);

    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\<file|.parquet".to_string();
    // let path: Result<Path, object_store::path::Error> = Path::parse(path);
    let path_from = Path::from(path);
    println!( "{:?}", path_from);

    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\<file|.parquet".to_string();
    let path_parse: Result<Path, object_store::path::Error> = Path::parse(path);
    println!( "{:?}", path_parse);
}
Path { raw: "C:%5Ctable%5Ctime=2021-01-02 03:04:06.000003%5Cfile.parquet" }
Ok(Path { raw: "C:\\table\\time=2021-01-02 03:04:06.000003\\file.parquet" })
Path { raw: "C:%5Ctable%5Ctime=2021-01-02 03:04:06.000003%5C%3Cfile%7C.parquet" }
Ok(Path { raw: "C:\\table\\time=2021-01-02 03:04:06.000003\\<file|.parquet" })

@thomasfrederikhoeck
Copy link
Contributor Author

Closed upstream in apache/arrow-rs#5830 so when a new object store is released an used it should be fixed ✌🏻

@thomasfrederikhoeck
Copy link
Contributor Author

thomasfrederikhoeck commented Nov 14, 2024

Just ran this on Windows using main and it works:

import pandas as pd
from deltalake import write_deltalake
from datetime import datetime

dates = pd.date_range(datetime(2021,1,1,3,4,6,3),datetime(2021,1,3,3,4,6))
df = pd.DataFrame({"time":dates, "a":[i for i in range(len(dates))]})

#Write with diffrent engines

write_deltalake("mytable",df, partition_by="time", mode="overwrite",engine="pyarrow")

write_deltalake("mytable",df, partition_by="time", mode="overwrite",engine="rust")

and the the paths serialize correctly:

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants