Add virtual ref support EAR-1183 #85

dcherian · 2024-09-20T18:37:04Z

No description provided.

[WIP]

linear · 2024-09-20T21:33:56Z

dcherian · 2024-09-20T21:34:39Z

icechunk/src/dataset.rs

+            assert_eq!(
+                ds.get_chunk(&new_array_path, &ChunkIndices(vec![0, 0, 0]), &range)
+                    .await?,
+                Some(range.slice(bytes1.clone()))


assuming seba did the math right ;)

dcherian · 2024-09-20T21:35:33Z

icechunk/src/storage/object_store.rs

+        match has_key {
+            true => self.get_chunk_from_cached_store(&cache_key, &path, options).await,
+            false => {
+                let builder = match scheme {


object_stores parse_from_url does not pull from the env by default, so I do our own matching here.

Assuming that env credentials are fine for now...

icechunk/src/storage/object_store.rs

* main: Bump the rust-dependencies group with 2 updates Better python creation, open, and storage API [EAR-1316] (#91) Add test checking only 1 concurrent commit can succeed Make local filesystem Storage implementation work Cache assets independently and by count, not by mem Disallow empty commits Delete dbg! call Expand S3 config, some nicer python init methods [EAR-1314][EAR-1311] (#84) Wrap empty and get methods in future_into_py add README

icechunk/src/storage/mod.rs

paraseba · 2024-09-23T21:38:57Z

icechunk/src/dataset.rs

+        assert_eq!(
+            ds.get_chunk(&new_array_path, &ChunkIndices(vec![0, 0, 1]), &ByteRange::ALL)
+                .await?,
+            Some(Bytes::copy_from_slice(&bytes2[1..6])),


lovely test

icechunk/src/format/manifest.rs

icechunk/src/format/mod.rs

icechunk/src/storage/object_store.rs

* main: Support zarr v3a5 (#71) test(ci): add minio to new test action EAR-1190 (#95) Some README edits (#96) Support nan/inf in FillValues S3 Storage overwrites refs

icechunk/src/dataset.rs

icechunk/src/storage/virtual_ref.rs

.github/workflows/rust-ci.yaml

icechunk/src/storage/mod.rs

icechunk/src/format/manifest.rs

icechunk/src/storage/virtual_ref.rs

icechunk/src/zarr.rs

icechunk/tests/test_virtual_refs.rs

icechunk/src/dataset.rs

icechunk/src/zarr.rs

* origin/main: Zarr Store does internal mutation now Add perf example (#98) Make s3 credentials more robust [EAR-1315] (#101)

paraseba · 2024-09-26T00:51:21Z

icechunk/src/dataset.rs

+                    )
+                    .await
+                    .map(|bytes| Some(ready(Ok(bytes)).boxed()))?)
+            }


I think this good at the type level but wrong semantically (which impacts on performance). The idea with this get_chunk_reader design is that we want to hold the reference to Self for as little as possible. We don't want to hold it all the time while the bytes are downloaded from S3. If we did that, nobody can write a new reference in the dataset, until we are not done downloading, which is a completely independent thing.

So, the way it works is, this function does the minimum it needs to do while holding the ref, and then returns a future that will do the actual work of downloading bytes when awaited. If you look above at the materialized case, we only resolve the chunk_ref and clone a pointer, that's enough to set up the Future we return.

You should do something similar: clone your virtual resolver and "move" it into a Feature that you return without awaiting.

Yes I was being dumb, sorry. Thanks for taking the time to write it out.

icechunk/src/storage/virtual_ref.rs

paraseba · 2024-09-26T00:58:45Z

icechunk/src/zarr.rs

+
+        match Key::parse(key)? {
+            Key::Metadata { .. } => Err(StoreError::NotAllowed(format!(
+                "use .set to modify metadata for key {}",


paraseba · 2024-09-26T01:03:26Z

icechunk/src/zarr.rs

@@ -511,6 +514,36 @@ impl Store {
        }
    }

+    // alternate API would take array path, and a mapping from string coord to ChunkPayload
+    pub async fn set_virtual_ref(


Actually, i wonder who would use this method, because zarr wont right? I'm confused about this: let's take @TomNicholas use case for inserting virtual refs. Which of these two alternatives would be easier for him?

store.set_virtual_ref("group/array/c/0/1/2", "s3://......") # or store.set_virtual_ref("/group/array", (0,1,2), "s3://.....")

Will virtual ref clients speak in the low level language of zarr keys? or in the higher level language of arrays and coordinates?

Of course, we can also offer both alternatives.

Will virtual ref clients speak in the low level language of zarr keys?

The virtualizarr ChunkManifest objects store refs internally in numpy arrays, with the ref for each chunk accessible via indexing with the zarr chunk key. (So internally the zarr key tuple (0,1,2) is used 3 times to index into 3 numpy arrays containing the path, offset and length.)

Which of these two alternatives would be easier for him?

The second one is marginally easier, but either would be fine. The second one is easier because virtualizarr ManifestArray/ChunkManifest objects are unaware of their own variable name or group. The array name information is stored in the xarray.Dataset object (as the key mapping to the xr.Variable), and non-root groups haven't really been properly tried yet but would in general become node names in an xr.DataTree (see zarr-developers/VirtualiZarr#84).

What I think I really want to be able to do is set an entire array's worth of virtual references with one call. Otherwise I'm just going to end up looping over the array elements anyway with something like:

def manifestarray_to_icechunk(group: str, arr_name: str, ma: ManifestArray, store: IcechunkStore) -> None: # loop over every reference in the ChunkManifest for that array for entry in np.nditer( [ma.manifest._paths, ma.manifest._offsets, ma.manifest._lengths], flags=['multi_index'], ): # set each reference individually store.set_virtual_ref( f"{group}/{arr_name}, entry.index, # your (0,1,2) tuple entry[0], # filepath for this element entry[1], # offset for this element entry[2], # length for this element )

Presumably you can do that more efficiently / concurrently at the rust level than I can at the python level.

The only exception to that desire to write whole arrays at once might be the case of appending, discussed in #104 (comment).

@paraseba we chatted about this today and I thought we decided that iterating in python was fine for a first pass. So my mental sketch is:

loop in python - pass reference (simple types) to python IcechunkStore - IcechunkStore sends the reference to PyIcechunkStore - Which processes it to the right types and sends it to Dataset

Will virtual ref clients speak in the low level language of zarr keys? or in the higher level language of arrays and coordinates?

I didn't think too hard about this since I assume this will get deleted in favor of crunching through the references in bulk by next week hehe. As Tom points out, we should just accept a NodePath and three arrays (location, offset, length) and iterate & parse them on the rust side.

Yes, definitely we can iterate on Rust to transfer a whole manifest of refs into icechunk. This function is only the first step.

icechunk/tests/test_virtual_refs.rs

* main: linter List operations fully realize results in memory Push sebas static list approach Update concurrency test to the latest rust api Sync main sync, test no longer runs Cleanup Use builtin tokio runtime Add a functional test for using a dataset with high concurrency Better timing and asserts Create a test that exercises the python store concurrently

[WIP] Add virtual ref support

9c00af0

[WIP]

dcherian force-pushed the virtual-refs branch from 02a13d9 to 9c00af0 Compare September 20, 2024 19:19

Better test

72c1096

dcherian changed the title ~~[WIP] Add virtual ref support~~ [WIP] Add virtual ref support EAR-1183 Sep 20, 2024

dcherian commented Sep 20, 2024

View reviewed changes

icechunk/src/storage/object_store.rs Outdated Show resolved Hide resolved

wip

c8ec4cb

paraseba reviewed Sep 20, 2024

View reviewed changes

dcherian added 3 commits September 23, 2024 09:53

Review comments

bcfe671

Arc instead of Box

ed18dd9

dcherian force-pushed the virtual-refs branch from 23613ce to f1e05c7 Compare September 23, 2024 16:50

Move trait to own file

8ae0427

dcherian commented Sep 23, 2024

View reviewed changes

icechunk/src/storage/mod.rs Outdated Show resolved Hide resolved

dcherian changed the title ~~[WIP] Add virtual ref support EAR-1183~~ Add virtual ref support EAR-1183 Sep 23, 2024

Set-time checking

4f52f1d

dcherian force-pushed the virtual-refs branch from 6c2e515 to 4f52f1d Compare September 23, 2024 18:49

lint

6133859

paraseba reviewed Sep 23, 2024

View reviewed changes

Merge branch 'main' into virtual-refs

7beb82c

* main: Support zarr v3a5 (#71) test(ci): add minio to new test action EAR-1190 (#95) Some README edits (#96) Support nan/inf in FillValues S3 Storage overwrites refs

paraseba reviewed Sep 23, 2024

View reviewed changes

icechunk/src/dataset.rs Outdated Show resolved Hide resolved

dcherian added 5 commits September 24, 2024 09:09

Cleanup

2dccd6c

Better locking

7071039

Move more code to virtual_ref.rs

a6403d1

Some more cleanup

f35cfd4

permute(Box, Send, Error, dyn, Sync)

d21231e

dcherian force-pushed the virtual-refs branch from b4852c5 to d21231e Compare September 24, 2024 17:42

dcherian commented Sep 24, 2024

View reviewed changes

icechunk/src/storage/virtual_ref.rs Show resolved Hide resolved

dcherian added 2 commits September 24, 2024 16:22

set all env vars

835f5af

remove dbg!

3bdafe1

dcherian marked this pull request as ready for review September 24, 2024 22:26

dcherian commented Sep 24, 2024

View reviewed changes

.github/workflows/rust-ci.yaml Show resolved Hide resolved

dcherian requested review from paraseba and mpiannucci September 24, 2024 22:27

paraseba reviewed Sep 25, 2024

View reviewed changes

mpiannucci reviewed Sep 25, 2024

View reviewed changes

icechunk/src/zarr.rs Outdated Show resolved Hide resolved

dcherian added 2 commits September 25, 2024 10:17

Cleanup

0ef8c8f

Update set_virtual_ref signature

60a14ef

TomNicholas mentioned this pull request Sep 25, 2024

Writing virtual references into Icechunk from VirtualiZarr #103

Closed

dcherian added 5 commits September 25, 2024 16:57

Better byte_range handling

c7ec230

Merge remote-tracking branch 'origin/main' into virtual-refs

4bd3176

* origin/main: Zarr Store does internal mutation now Add perf example (#98) Make s3 credentials more robust [EAR-1315] (#101)

Beter test

fa8fde5

cleanup meatdata error message

208d1ce

Better Error

d6241f4

dcherian force-pushed the virtual-refs branch from 6094696 to d6241f4 Compare September 25, 2024 23:39

dcherian requested review from paraseba and mpiannucci September 25, 2024 23:48

dcherian added 2 commits September 25, 2024 17:50

lint

2f7c479

update justfile with envvars

a0bb597

paraseba reviewed Sep 26, 2024

View reviewed changes

dcherian added 4 commits September 25, 2024 21:23

fix parallelism

3f625a0

cleanup

e402d42

Update README

0785687

paraseba approved these changes Sep 26, 2024

View reviewed changes

dcherian merged commit d1dbfa2 into main Sep 26, 2024
3 checks passed

dcherian deleted the virtual-refs branch September 26, 2024 13:23

TomNicholas mentioned this pull request Oct 18, 2024

Use Case: [C]Worthy OAE dataset #119

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add virtual ref support EAR-1183 #85

Add virtual ref support EAR-1183 #85

dcherian commented Sep 20, 2024

linear bot commented Sep 20, 2024

dcherian Sep 20, 2024

dcherian Sep 20, 2024 •

edited

Loading

paraseba Sep 23, 2024

paraseba Sep 26, 2024

dcherian Sep 26, 2024

paraseba Sep 26, 2024

paraseba Sep 26, 2024

TomNicholas Sep 26, 2024 •

edited

Loading

dcherian Sep 26, 2024

paraseba Sep 26, 2024

Add virtual ref support EAR-1183 #85

Add virtual ref support EAR-1183 #85

Conversation

dcherian commented Sep 20, 2024

linear bot commented Sep 20, 2024

Choose a reason for hiding this comment

dcherian Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TomNicholas Sep 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian Sep 20, 2024 •

edited

Loading

TomNicholas Sep 26, 2024 •

edited

Loading