-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Better python creation, open, and storage API [EAR-1316] #91
Conversation
EAR-1316 Create ergonomic configuration for icechunk python
Currently we send configuration via a dictionary. We should create a type safe rust workflow for configuring datasets and storage from python |
use pyo3::{prelude::*, types::PyType}; | ||
|
||
#[pyclass(name = "Storage")] | ||
pub enum PyStorage { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have to basically reimplement this from the rust Zarr config to expose it to python, but its typesafe so any updates to one will have to happen to the other
|
||
@pytest.mark.xfail(reason="Not implemented") | ||
def test_store_repr(self, store: IcechunkStore) -> None: | ||
super().test_store_repr(store) | ||
|
||
async def test_not_writable_store_raises(self, store_kwargs: dict[str, Any]) -> None: | ||
create_kwargs = {**store_kwargs, "mode": "r"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having trouble creating then reopening in this fixture, will come back later, not very important
icechunk-python/src/lib.rs
Outdated
} else { | ||
icechunk::zarr::AccessMode::ReadWrite | ||
}; | ||
let cached_storage = storage.into_cached(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
python stores will always be cached for now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
} | ||
|
||
#[pyfunction] | ||
fn pyicechunk_store_create<'py>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these cant be class methods because of async for now, not sure why, but this works just fine
if access_mode.overwrite: | ||
store = await cls.create(storage, mode, *args, **kwargs) | ||
elif access_mode.create or access_mode.update: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a public function Dataset::exists(storage: &(dyn Storage + Send + Sync))
that could be helpful here.
raise ValueError("Storage configuration is required. Pass a Storage object to construct an IcechunkStore") | ||
|
||
if access_mode.overwrite: | ||
store = await cls.create(storage, mode, *args, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we shouldn't support overwrite? Icechunk stores are "too precious" maybe? This will fail anyway, but maybe, we shouldn't even let you create in overwrite mode?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My thought is that in overwrite mode, it tries to create, and if it fails they get the error message. If it doesn't exist people may call with mode=w and be confused it doesn't work
class PyAsyncStringGenerator(AsyncGenerator[str | None]): | ||
def __aiter__(self) -> PyAsyncStringGenerator: ... | ||
async def __anext__(self) -> str | None: ... | ||
|
||
|
||
class Storage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
love it
icechunk-python/src/lib.rs
Outdated
} else { | ||
icechunk::zarr::AccessMode::ReadWrite | ||
}; | ||
let cached_storage = storage.into_cached(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
StorageConfig::Cached { | ||
approx_max_memory_bytes: 1_000_000, | ||
backend: Box::new(self), | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change to use the new Dataset::add_in_mem_asset_caching
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes!
|
||
pub fn with_unsafe_overwrite_refs(mut self, unsafe_overwrite_refs: bool) -> Self { | ||
self.unsafe_overwrite_refs = Some(unsafe_overwrite_refs); | ||
self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder ... shouldn't we just expose the DatasetBuilder to python instead? Is it worth having the extra config classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Python really just needs the storage config, which is not part of the dataset builder currently so I'm not sure that makes sense right now
} | ||
} | ||
|
||
pub fn make_cached_storage(&self) -> Result<Arc<dyn Storage + Send + Sync>, String> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed cached as a storage config, and now you just either create cached storage or not. Lemme know what you think @paraseba
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
works for me
* main: Bump the rust-dependencies group with 2 updates Better python creation, open, and storage API [EAR-1316] (#91) Add test checking only 1 concurrent commit can succeed Make local filesystem Storage implementation work Cache assets independently and by count, not by mem Disallow empty commits Delete dbg! call Expand S3 config, some nicer python init methods [EAR-1314][EAR-1311] (#84) Wrap empty and get methods in future_into_py add README
Adds a new typesafe storage API to the python bindings. This allows users to use a rust enum from python to create their storage configuration.
This also makes use of the builtin
Zarr Store.open()
function to make the correct calls toIcechunkStore.create
andIcechunkStore.open
.In draft because it should be preceeded by #88