Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: basic fsspec writing #1016

Merged
merged 78 commits into from
Nov 15, 2023
Merged

feat: basic fsspec writing #1016

merged 78 commits into from
Nov 15, 2023

Conversation

lobis
Copy link
Collaborator

@lobis lobis commented Nov 6, 2023

This PR adds writing of root files via fsspec which allows writing files to some interesting backends such as ssh, memory, s3, etc. Some are not yet supported such as http(s). In principle all the fsspec backends that support "wb" mode should work fine.

If the backend requires some special option, it can be passed in the uproot.recreate call such as uproot.recreate("ssh:///tmp/file.root", host="localhost", user="user").

This is a basic implementation. File is writen on close.

The implementation is as follows:

  • If the path to write is a file-like object, behaviour is as before.
  • If it is a string, it will check if it has some scheme. If it doesn't behavour is as before.
  • If it does have some scheme, it will use fsspec to create a file-like object and produce a sink from it, in a similar way to passing a file-like object to the uproot.recreate method. The sink will hold a reference to the object and is responsible for closing it.

Minor features:

  • When writing to a local path, the parent directories of the file will be created if they don't exist instead of throwing an exception.

As mentioned this is a basic writer using fsspec. When PR becomes a requirement I will refactor the sink class and the surrounding code to make this cleaner.

@lobis lobis marked this pull request as ready for review November 14, 2023 00:53
@lobis lobis requested review from jpivarski and nsmith- November 14, 2023 00:53
Copy link
Member

@nsmith- nsmith- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we know uproot will rewrite many parts of the file as it progresses, I wonder if it makes sense to universally append a simplecache:: to any protocol that isn't already a local one?

tests/test_0692_fsspec_writing.py Show resolved Hide resolved
@lobis lobis requested a review from nsmith- November 14, 2023 22:17
Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It all looks good! I see that it's tested for local files, in-memory files, and ssh, which are three cases that make sense (depending on how remote the ssh server is).

I'll need to know more about what a "chain of filesystems" would do. Is it something we'd even be able to use, given our API?

I think it's ready to merge. I don't know why the one case below isn't raising an error, and it's an odd way to make a note-to-self about what to do next.

src/uproot/writing/writable.py Show resolved Hide resolved
tests/test_0692_fsspec_writing.py Show resolved Hide resolved
@lobis
Copy link
Collaborator Author

lobis commented Nov 15, 2023

I'll need to know more about what a "chain of filesystems" would do. Is it something we'd even be able to use, given our API?

I guess it has some niche uses, for example, opening a root file inside a zip file available via http. It should also work for writing. But in general I think only the local caching for writing will be used "frequently".

@lobis lobis merged commit 28b0f7b into main Nov 15, 2023
@lobis lobis deleted the fsspec-writing branch November 15, 2023 20:44
@matthewfeickert
Copy link
Member

matthewfeickert commented Nov 16, 2023

👋 I just wanted to raise that this PR breaks pyhf's nightly HEAD of dependency tests. For example, installing uproot from 28b0f7b causes pyhf's test_export_root_histogram test in tests/test_export.py to fail. Here's the relevant bit

def test_export_root_histogram(mocker, tmp_path):
    """
    Test that pyhf.writexml._export_root_histogram writes out a histogram
    in the manner that uproot is expecting and verifies this by reading
    the serialized file
    """
    mocker.patch("pyhf.writexml._ROOT_DATA_FILE", {})
    pyhf.writexml._export_root_histogram("hist", [0, 1, 2, 3, 4, 5, 6, 7, 8])

    with uproot.recreate(tmp_path.joinpath("test_export_root_histogram.root")) as file:
        file["hist"] = pyhf.writexml._ROOT_DATA_FILE["hist"]

which then fails with a TypeError and the following traceback

Collapsed for space:
========================================================================================== FAILURES ==========================================================================================
_________________________________________________________________________________ test_export_root_histogram _________________________________________________________________________________

mocker = <pytest_mock.plugin.MockerFixture object at 0x7ff50e8d4cd0>, tmp_path = PosixPath('/tmp/pytest-of-feickert/pytest-11/test_export_root_histogram0')

    def test_export_root_histogram(mocker, tmp_path):
        """
        Test that pyhf.writexml._export_root_histogram writes out a histogram
        in the manner that uproot is expecting and verifies this by reading
        the serialized file
        """
        mocker.patch("pyhf.writexml._ROOT_DATA_FILE", {})
        pyhf.writexml._export_root_histogram("hist", [0, 1, 2, 3, 4, 5, 6, 7, 8])
    
>       with uproot.recreate(tmp_path.joinpath("test_export_root_histogram.root")) as file:

mocker     = <pytest_mock.plugin.MockerFixture object at 0x7ff50e8d4cd0>
tmp_path   = PosixPath('/tmp/pytest-of-feickert/pytest-11/test_export_root_histogram0')

tests/test_export.py:420: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../.pyenv/versions/pyhf-dev-cpu/lib/python3.11/site-packages/uproot/writing/writable.py:148: in recreate
    sink = _sink_from_path(file_path, **storage_options)
        file_path  = PosixPath('/tmp/pytest-of-feickert/pytest-11/test_export_root_histogram0/test_export_root_histogram.root')
        options    = {}
        storage_options = {}
../../../.pyenv/versions/pyhf-dev-cpu/lib/python3.11/site-packages/uproot/writing/writable.py:80: in _sink_from_path
    return uproot.sink.file.FileSink.from_object(file_path_or_object)
        file_path_or_object = PosixPath('/tmp/pytest-of-feickert/pytest-11/test_export_root_histogram0/test_export_root_histogram.root')
        storage_options = {}
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

cls = <class 'uproot.sink.file.FileSink'>, obj = PosixPath('/tmp/pytest-of-feickert/pytest-11/test_export_root_histogram0/test_export_root_histogram.root')

    @classmethod
    def from_object(cls, obj) -> FileSink:
        """
        Args:
            obj (file-like object): An object with ``read``, ``write``, ``seek``,
                ``tell``, and ``flush`` methods.
    
        Creates a :doc:`uproot.sink.file.FileSink` from a file-like object, such
        as ``io.BytesIO``. The object must be readable, writable, and seekable
        with ``"r+b"`` mode semantics.
        """
        if (
            callable(getattr(obj, "read", None))
            and callable(getattr(obj, "write", None))
            and callable(getattr(obj, "seek", None))
            and callable(getattr(obj, "tell", None))
            and callable(getattr(obj, "flush", None))
            and (not hasattr(obj, "readable") or obj.readable())
            and (not hasattr(obj, "writable") or obj.writable())
            and (not hasattr(obj, "seekable") or obj.seekable())
        ):
            self = cls(None)
            self._file = obj
        else:
>           raise TypeError(
                """writable file can only be created from a file path or an object
    
    * that has 'read', 'write', 'seek', and 'tell' methods
    * is 'readable() and writable() and seekable()'"""
            )
E           TypeError: writable file can only be created from a file path or an object
E           
E               * that has 'read', 'write', 'seek', and 'tell' methods
E               * is 'readable() and writable() and seekable()'

cls        = <class 'uproot.sink.file.FileSink'>
obj        = PosixPath('/tmp/pytest-of-feickert/pytest-11/test_export_root_histogram0/test_export_root_histogram.root')

../../../.pyenv/versions/pyhf-dev-cpu/lib/python3.11/site-packages/uproot/sink/file.py:55: TypeError
================================================================================== short test summary info ===================================================================================
FAILED tests/test_export.py::test_export_root_histogram - TypeError: writable file can only be created from a file path or an object

I can raise an Issue if that's helpful, but I thought I'd start here in case this is relevant to ongoing discussion.

@alexander-held
Copy link
Member

alexander-held commented Nov 16, 2023

Hi, I ran into the same thing in cabinetry, here is a minimal reproducer:

import pathlib
import uproot

uproot.recreate(pathlib.Path("f.root"))

I will open an issue for reference. -> #1029

@jpivarski
Copy link
Member

Looking at the stack trace, it must be because pyhf was relying on passing a file-like object obj that satisfies

    callable(getattr(obj, "read", None))
and callable(getattr(obj, "write", None))
and callable(getattr(obj, "seek", None))
and callable(getattr(obj, "tell", None))
and callable(getattr(obj, "flush", None))
and (not hasattr(obj, "readable") or obj.readable())
and (not hasattr(obj, "writable") or obj.writable())
and (not hasattr(obj, "seekable") or obj.seekable())

and this has somehow changed. (It wasn't supposed to.) @lobis?


I just tried a test and I don't see what's wrong with it.

import os
import numpy as np
import uproot

class FileLikeObject:
    def __init__(self):
        self.data = np.zeros(0, dtype=np.uint8)
        self.pos = 0
    def tobytes(self) -> bytes:
        return self.data.tobytes()
    def ensure(self, size: int) -> None:
        if size > len(self.data):
            data = np.zeros(size, dtype=np.uint8)
            data[:len(self.data)] = self.data
            self.data = data
    def read(self, num_bytes: int) -> bytes:
        self.pos += num_bytes
        self.ensure(self.pos)
        return self.data[self.pos - num_bytes : self.pos].tobytes()
    def write(self, data: bytes) -> None:
        self.pos += len(data)
        self.ensure(self.pos)
        self.data[self.pos - len(data) : self.pos] = np.frombuffer(data, np.uint8)
    def seek(self, offset: int, whence: int = os.SEEK_SET) -> None:
        assert whence in (os.SEEK_SET, os.SEEK_CUR, os.SEEK_END)
        if whence == os.SEEK_SET:
            self.pos = offset
        elif whence == os.SEEK_CUR:
            self.pos += offset
        elif whence == os.SEEK_END:
            self.pos = len(self.data) + offset
        self.ensure(self.pos)
    def tell(self) -> int:
        return self.pos
    def flush(self) -> None:
        pass
    # def readable(self) -> bool:
    #     return True
    # def writable(self) -> bool:
    #     return True
    # def seekable(self) -> bool:
    #     return True

file = FileLikeObject()
writable_file = uproot.recreate(file)
writable_file["hist"] = np.histogram(np.random.normal(0, 1, 1000), bins=100, range=(-5, 5))

file.tobytes()   # lots of bytes

readable_file = uproot.open(file)
readable_file["hist"].to_hist()

image

It doesn't matter whether the readable, writable, seekable methods exist or not.

@alexander-held
Copy link
Member

alexander-held commented Nov 16, 2023

Looking at the stack trace, it must be because pyhf was relying on passing a file-like object obj that satisfies

I think this happened via the tmp_path fixture in pytest creating a pathlib.Path (-like?) object (#1029).

@jpivarski
Copy link
Member

I need to read all of my email before responding to anything...

@lobis
Copy link
Collaborator Author

lobis commented Nov 16, 2023

(sorry I didn't realise there was activity here after closing the issue).

There is a check to check if the argument is a path or an object. It was doing the check poorly (checking if it's a string) and failed when a path object was being passed. Now it tests for object (with a new helper function), otherwise assumes path-like object (and transforms to string).

#1031 is ready to be merged with the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants