Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colon (:) in in object_store::path::{Path} is not handled on Windows #5592

Closed
thomasfrederikhoeck opened this issue Apr 4, 2024 · 10 comments · Fixed by #5830
Closed

Colon (:) in in object_store::path::{Path} is not handled on Windows #5592

thomasfrederikhoeck opened this issue Apr 4, 2024 · 10 comments · Fixed by #5830
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted

Comments

@thomasfrederikhoeck
Copy link

Describe the bug
When creating a Path on Windows colon (:) is not sanitized like < and |.

To Reproduce

use object_store::path::{Path};

fn main() {
    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\file.parquet".to_string();
    // let path: Result<Path, object_store::path::Error> = Path::parse(path);
    let path_from = Path::from(path);
    println!( "{:?}", path_from);

    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\file.parquet".to_string();
    let path_parse: Result<Path, object_store::path::Error> = Path::parse(path);
    println!( "{:?}", path_parse);

    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\<file|.parquet".to_string();
    // let path: Result<Path, object_store::path::Error> = Path::parse(path);
    let path_from = Path::from(path);
    println!( "{:?}", path_from);

    let path: String = r"C:\table\time=2021-01-02 03:04:06.000003\<file|.parquet".to_string();
    let path_parse: Result<Path, object_store::path::Error> = Path::parse(path);
    println!( "{:?}", path_parse);
}
Path { raw: "C:%5Ctable%5Ctime=2021-01-02 03:04:06.000003%5Cfile.parquet" }
Ok(Path { raw: "C:\\table\\time=2021-01-02 03:04:06.000003\\file.parquet" })
Path { raw: "C:%5Ctable%5Ctime=2021-01-02 03:04:06.000003%5C%3Cfile%7C.parquet" }
Ok(Path { raw: "C:\\table\\time=2021-01-02 03:04:06.000003\\<file|.parquet" })

See SO for forbidden chars https://stackoverflow.com/questions/1976007/what-characters-are-forbidden-in-windows-and-linux-directory-names
Expected behavior
I would have expected the output to be

Path { raw: "C:%5Ctable%5Ctime=2021-01-02 03%3A04%3A06.000003%5Cfile.parquet" }
Ok(Path { raw: "C:\\table\\time=2021-01-02 03:04:06.000003\\file.parquet" })
Path { raw: "C:%5Ctable%5Ctime=2021-01-02 03%3A04%3A06.000003%5C%3Cfile%7C.parquet" }
Ok(Path { raw: "C:\\table\\time=2021-01-02 03:04:06.000003\\<file|.parquet" })

Additional context
This came up here delta-io/delta-rs#2382

@thomasfrederikhoeck thomasfrederikhoeck changed the title Colon (:) in in object_store::path::{Path} is not handled in Windows Colon (:) in in object_store::path::{Path} is not handled on Windows Apr 4, 2024
@tustvold tustvold added enhancement Any new improvement worthy of a entry in the changelog and removed bug labels Apr 5, 2024
@tustvold
Copy link
Contributor

tustvold commented Apr 5, 2024

This would be a breaking change but probably makes sense on balance

@thomasfrederikhoeck
Copy link
Author

@tustvold I guess it would be breaking but would anyone rely on a path that can't exists? Or does this sanitizing happen for the same set of chars across all platforms?

@tustvold
Copy link
Contributor

tustvold commented Apr 5, 2024

It is the same set of chars across all platforms, although I suppose we could just add some special logic for windows within LocalFilesystem...

@thomasfrederikhoeck
Copy link
Author

thomasfrederikhoeck commented Apr 8, 2024

@tustvold if you think that solution is good I can give it a go. Might be a little out of my leauge.

@tustvold
Copy link
Contributor

tustvold commented Apr 8, 2024

I think probably we should add specific handling within LocalFilesystem on Windows. Just because Windows filesystems are a complete mess, I don't think that shouldn't leak out. This would also avoid a breaking change

Perhaps we could explicitly escape them or something on Windows platforms? Is there standard way the windows ecosystem works around this limitation?

@thomasfrederikhoeck
Copy link
Author

@tustvold I did a lot of googling and from what I can tell there is no default way. I think it is a good route to go to just explicitly escape them. Maybe just starting with : since I expect it to be much more heavily used for data given it is used for timestamps compared to the other chars.

@hesampakdaman
Copy link
Contributor

It is the same set of chars across all platforms, although I suppose we could just add some special logic for windows within LocalFilesystem...

I was going to suggest to have several platform specific INVALID consts 😂

modified   object_store/src/path/parts.rs
@@ -73,6 +73,7 @@ impl<'a> PathPart<'a> {
 }
 
 /// Characters we want to encode.
+#[cfg(target_family="windows")]
 const INVALID: &AsciiSet = &CONTROLS
     // The delimiter we are reserving for internal hierarchy
     .add(DELIMITER_BYTE)
@@ -97,8 +98,38 @@ const INVALID: &AsciiSet = &CONTROLS
     .add(b'\r')
     .add(b'\n')
     .add(b'*')
+    .add(b':')
     .add(b'?');

I think probably we should add specific handling within LocalFilesystem on Windows. Just because Windows filesystems are a complete mess, I don't think that shouldn't leak out. This would also avoid a breaking change

I didn't have a thorough look but does it suffice to have this special logic in path_to_filesystem?

    /// Return an absolute filesystem path of the given file location
    pub fn path_to_filesystem(&self, location: &Path) -> Result<PathBuf> {
        ensure!(
            is_valid_file_path(location),
            InvalidPathSnafu {
                path: location.as_ref()
            }
        );
        self.config.prefix_to_filesystem(location)
    }

@tustvold
Copy link
Contributor

It would be a breaking change, not to mention extremely disruptive, to add : to the Invalid set, as this set applies to all paths regardless of store. Given Windows LocalFilesystem is the only store with this problem, and colons are very common, I think the least disruptive change would be to add some fallback logic in LocalFilesystem on Windows that escapes them, potentially disabled by default.

@hesampakdaman
Copy link
Contributor

Thanks @tustvold! I'll gladly help out but might need some guidance to drive it home. I've created a draft, appreciate if you could take a look to see if I'm on the right track :)

@thomasfrederikhoeck
Copy link
Author

I agree with @tustvold that it should windows-only but I think it should be true be default given that it can never be a valid path on windows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog good first issue Good for newcomers help wanted
Projects
None yet
3 participants