Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better python creation, open, and storage API [EAR-1316] #91

Merged
merged 8 commits into from
Sep 23, 2024

Conversation

mpiannucci
Copy link
Contributor

@mpiannucci mpiannucci commented Sep 22, 2024

Adds a new typesafe storage API to the python bindings. This allows users to use a rust enum from python to create their storage configuration.

This also makes use of the builtin Zarr Store.open() function to make the correct calls to IcechunkStore.create and IcechunkStore.open.

In draft because it should be preceeded by #88

Copy link

linear bot commented Sep 22, 2024

EAR-1316 Create ergonomic configuration for icechunk python

Currently we send configuration via a dictionary. We should create a type safe rust workflow for configuring datasets and storage from python

use pyo3::{prelude::*, types::PyType};

#[pyclass(name = "Storage")]
pub enum PyStorage {
Copy link
Contributor Author

@mpiannucci mpiannucci Sep 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have to basically reimplement this from the rust Zarr config to expose it to python, but its typesafe so any updates to one will have to happen to the other


@pytest.mark.xfail(reason="Not implemented")
def test_store_repr(self, store: IcechunkStore) -> None:
super().test_store_repr(store)

async def test_not_writable_store_raises(self, store_kwargs: dict[str, Any]) -> None:
create_kwargs = {**store_kwargs, "mode": "r"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having trouble creating then reopening in this fixture, will come back later, not very important

} else {
icechunk::zarr::AccessMode::ReadWrite
};
let cached_storage = storage.into_cached();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

python stores will always be cached for now

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}

#[pyfunction]
fn pyicechunk_store_create<'py>(
Copy link
Contributor Author

@mpiannucci mpiannucci Sep 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these cant be class methods because of async for now, not sure why, but this works just fine

if access_mode.overwrite:
store = await cls.create(storage, mode, *args, **kwargs)
elif access_mode.create or access_mode.update:
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a public function Dataset::exists(storage: &(dyn Storage + Send + Sync)) that could be helpful here.

raise ValueError("Storage configuration is required. Pass a Storage object to construct an IcechunkStore")

if access_mode.overwrite:
store = await cls.create(storage, mode, *args, **kwargs)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we shouldn't support overwrite? Icechunk stores are "too precious" maybe? This will fail anyway, but maybe, we shouldn't even let you create in overwrite mode?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought is that in overwrite mode, it tries to create, and if it fails they get the error message. If it doesn't exist people may call with mode=w and be confused it doesn't work

class PyAsyncStringGenerator(AsyncGenerator[str | None]):
def __aiter__(self) -> PyAsyncStringGenerator: ...
async def __anext__(self) -> str | None: ...


class Storage:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

love it

} else {
icechunk::zarr::AccessMode::ReadWrite
};
let cached_storage = storage.into_cached();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

StorageConfig::Cached {
approx_max_memory_bytes: 1_000_000,
backend: Box::new(self),
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change to use the new Dataset::add_in_mem_asset_caching

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes!


pub fn with_unsafe_overwrite_refs(mut self, unsafe_overwrite_refs: bool) -> Self {
self.unsafe_overwrite_refs = Some(unsafe_overwrite_refs);
self
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder ... shouldn't we just expose the DatasetBuilder to python instead? Is it worth having the extra config classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python really just needs the storage config, which is not part of the dataset builder currently so I'm not sure that makes sense right now

@mpiannucci mpiannucci marked this pull request as ready for review September 23, 2024 00:09
}
}

pub fn make_cached_storage(&self) -> Result<Arc<dyn Storage + Send + Sync>, String> {
Copy link
Contributor Author

@mpiannucci mpiannucci Sep 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed cached as a storage config, and now you just either create cached storage or not. Lemme know what you think @paraseba

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

works for me

@mpiannucci mpiannucci merged commit e02431e into main Sep 23, 2024
3 checks passed
@mpiannucci mpiannucci deleted the matt/better-python-config branch September 23, 2024 00:29
dcherian added a commit that referenced this pull request Sep 23, 2024
* main:
  Bump the rust-dependencies group with 2 updates
  Better python creation, open, and storage API [EAR-1316] (#91)
  Add test checking only 1 concurrent commit can succeed
  Make local filesystem Storage implementation work
  Cache assets independently and by count,  not by mem
  Disallow empty commits
  Delete dbg! call
  Expand S3 config, some nicer python init methods [EAR-1314][EAR-1311] (#84)
  Wrap empty and get methods in future_into_py
  add README
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants