Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zarr pointer to existing files #631

Closed
dschneiderch opened this issue Oct 9, 2020 · 18 comments
Closed

zarr pointer to existing files #631

dschneiderch opened this issue Oct 9, 2020 · 18 comments

Comments

@dschneiderch
Copy link

Problem description

I'm looking for a solution that uses existing image files in a more structured manner. I am exploring a couple options and would like to avoid duplicating data. In phenomics we generate time series of images and then process them using computer vision techniques. I end up with reasonably "large" datasets ~10GB - 30GB and we want to maintain the ability to view the images in the filesystem. The images are stored in a db during an experiment and then we pull them to the file system for viewing and processing.

please correct me if this is wrong, but I don't see a method to save a zarr without writing the binary chunks to the store.
zarr seems like it could naturally extend its format by including the path to the actual image (an array called image.png) in the .zarray file that lives in e.g. image.zarr. in my use case the chunk size is naturally just the image shape. If i understand zarr correctly, one of the benefits here would be the ability to use the hierarchy/grouping functionality since some images come in groups. e.g.
on day=1, I have a 300 sec timeseries with 2 images (frame) approximately every 20 secs (parameter). so it would be great to have a way to group images by parameter and then by frame and easily call them by relationship. I have this iteration of images for 10 consecutive days and for 40 different plant barcodes.

other options I'm considering are making a xarray dataset or using metadata text files to virtually group the images (inspired by GDAL VRT

maybe this is something for spec v3 but i'm open to suggestions. tagging collaborator @nfahlgren
Thanks!

Version and installation information

  • zarr 2.4.0
  • numcodecs 0.7.2
  • python 3.7.8 from conda-forge
  • Windows
  • using conda
@manzt
Copy link
Member

manzt commented Oct 9, 2020

Just chiming in as a Zarr user who's experimented a bit with adapting existing image formats to zarr.

I think this can best be accomplished by creating a custom store that maps Zarr keys to individual images (files). In this case, the store needs to provide some additional metadata (.zarray, .zgroup, etc) that describe the hierarchy for the images that you desire.

If they are jpeg/png, the store will need to take care of decoding/encoding the images. I have actually prototyped something quite similar to this to view Deep Zoom Image (DZI) pyramids as Zarr.

DZI is a format for image pyramids where each level of a pyramid is a directory of jpeg/png "tiles".

Here is the custom store implementation that maps the DZI format to the multiscale zarr specification:

https://github.com/manzt/napari-dzi-zarr/blob/master/napari_dzi_zarr/store.py

@rabernat
Copy link
Contributor

rabernat commented Oct 9, 2020

This is closely related to #556.

@manzt
Copy link
Member

manzt commented Oct 9, 2020

@rabernat my understanding here is that there are many separate files (jpegs) and the desire is to map those files as Zarr array chunks within a hierarchy. In #556, there is a single binary container and the mapping is to bytes-ranges within the container. Key difference as well is the need for the store to perform decoding/encoding.

@rabernat
Copy link
Contributor

rabernat commented Oct 9, 2020

I see your point. In my mind, what they have in common is the desire to "wrap" an existing storage scheme with Zarr.

@jakirkham
Copy link
Member

Might also find this rough spec extension idea ( zarr-developers/zarr-specs#82 ) of interest 😉

@dschneiderch
Copy link
Author

thanks, @manzt i think your interpretation is correct (i was actually just looking at your napari project) and zarr-specs#82 indeed seems like it would address my use case.... if I understood it correctly-a very technical explanation. I will keep 👀 on that and explore the napari dzi zarr store.

@rabernat
Copy link
Contributor

@dschneiderch: we have been working on a proposed specification to map collections of binary files to zarr chunks in https://github.com/intake/fsspec-reference-maker. It would be great to get your feedback on whether that would meet your use case.

@manzt
Copy link
Member

manzt commented Mar 16, 2021

As it's recently been merged, the v1 spec should allow you to explicitly map your image files to the zarr data model.

{
  "version": 1,
  "templates": {
    "path": "file://data_dir" 
  },
  "gen": [], // see spec for more info, but can dynamically generate references
  "refs": {
    ".zarray":  "{\n    \"chunks\": [\n 512, 512, 3  \n],\n    \"compressor\": null,\n    \"dtype\": \"u1\",\n  ...",
    "0.0.0": ["{{ path }}/img0.jpg"],
    "1.0.0": ["{{ path }}/img1.jpg"],
  }
}

There is an open PR to add v1 to fsspec (which backs the zarr.FSStore). This unfortunately doesn't fix the problem the decoding of jpeg encoded chunks (something that isn't handled natively in zarr by numcodecs).

One option is to register a custom codec for zarr to use when accessing array chunks. This means you'll need to specify a "compressor" field in the .zarray metadata with a configuration that maps to your codec so that zarr knows how to decode each chunk. I haven't given this a try myself, but this repo might be of interest: https://github.com/d-v-b/zarr-jpeg

@dschneiderch
Copy link
Author

great to see this! sorry I didn't chime in earlier.
I'd give this a try but I have to admit i'm pretty much lost on how this gets used.

I tried:

import zarr
import glob

fns = glob.glob('data/psII/dataset-A1-20200531/*.png')
store1 = zarr.DirectoryStore('data/psII/dataset-A1-20200531.zarr')
root = zarr.group(store1, overwrite=True)
baz = root.create_dataset('data', data=fns, chunks=(len(fns),))

I also tried to use the json directly in a refgroup.json:

{
  "version": 1,
  "templates": {
    "path": "file://data/psII/dataset-A1-20200531" 
  },
  "gen": [], // see spec for more info, but can dynamically generate references
  "refs": {
    ".zarray":  "{\n    \"chunks\": [\n 480, 640, 1  \n],\n    \"compressor\": null,\n    \"dtype\": \"u1\",\n  ...",
    "0.0.0": ["{{ path }}/A1-doi-20200531T210155-PSII0-1.png"],
    "1.0.0": ["{{ path }}/A1-doi-20200531T210155-PSII0-2.png"],
  }
}

and then based on the test in the PR

import fsspec
fs = fsspec.filesystem("reference", references='./refgroup.json')

but that gives NotImplementedError: Only works with async targets

i have a basic folder structure:
(plantcv-dev) C:\Users\dominikschneider\Documents\phenomics\zarr_test>tree /F
Folder PATH listing for volume OS
Volume serial number is F0EE-E8D5
C:.
│ readfiles.py
│ refgroup.json
└───data
└───psII
└───dataset-A1-20200531
A1-doi-20200531T210155-PSII0-1.png
A1-doi-20200531T210155-PSII0-2.png

I still need to figure out the encoding/decoding part for png.
tagging @nfahlgren too

@manzt
Copy link
Member

manzt commented Apr 7, 2021

I'd give this a try but I have to admit i'm pretty much lost on how this gets used.

Realizing my response likely confused more than helped. The short answer is that the ReferenceFileSytem provides a formal specification to express the idea of "zarr pointer to existing files". The issue is that you need to write some code to generate this description (in JSON), effectively translating your custom directory structure to the Zarr data model. Let me elaborate.

Using your directory of PNGs as an example, we can think of each PNG as a compressed Zarr Array "chunk". It is up to you, how you want to organize these "chunks" in a Zarr hierarchy. You could treat each chunk as an individual Zarr Array, or you could layout each chunk into a single multi-dimensional Zarr Array. The latter is likely how you'd like to use Zarr, but this is only possible if each PNG "chunk" is the same shape.

With your two PNGs, we can think of theoretical Zarr Array having the following attributes:

  • shape: [2, height, width, 3] - shape of your virtual array
  • chunks: [1, height, width, 3] - shape of each PNG "chunk"
  • order: "C" - PNGs have row-major byte layout
  • dtype: "|u1"- PNGs are uint8 arrays
  • compressor: PNG compression

In Zarr, this "Array" is written to a store with the following keys: .zarray, 0.0.0, 1.0.0. The default store in Zarr is the file system, so this can be written to disk like:

.
└── data.zarr/
    ├── .zarray # array metadata (JSON)
    ├── 0.0.0   # A1doi-20200531T210155-PSII0-1.png
    └── 1.0.0   # A1-doi-20200531T210155-PSII0-2.png

The issue is that you don't want to rename files, and .zarray doesn't exist. This is where the FileSystemReference can help. It allows you to:

1.) Create the missing .zarray metadata
2.) Explicitly map Zarr keys to each PNG

  • 0.0.0 -> A1doi-20200531T210155-PSII0-1.png
  • 1.0.0 -> A1-doi-20200531T210155-PSII0-2.png

Therefore, reference description would look something like:

// reference.json
{
  "version": 1, 
  "templates": {
    "path": "file://data/psII/dataset-A1-20200531"
  }, 
"refs": {
  ".zarray": "{ \"chunks\": [1, 512, 512, 3 ], \"compressor\": null, \"dtype\": \"|u1\", \"fill_value\": null, \"filters\": null,\"order\": \"C\", \"shape\": [2, 512, 512, 3], \"zarr_format\": 2 }",
  "0.0.0": ["{{ path }}/A1-doi-20200531T210155-PSII0-1.png"],
  "1.0.0":[ "{{ path }}/A1-doi-20200531T210155-PSII0-2.png"]
  }
}

See how the .zarray metadata is encoded as a string inline and the other entries point to the PNGs on disk.

This Zarr Array metadata can look tedious to write, but Zarr actually has some utilities to write this. I would personally creat this reference in python using the following:

# write_reference.py
from zarr.storage import init_array
import imageio

example_chunk = imageio.imread('A1-doi-20200531T210155-PSII0-1.png')

refs = dict()
# writes ".zarray" to refs
init_array(
  refs,
  shape=(2,) + example_chunk.shape,
  chunks=(1,) + example_chunk.shape,
  dtype="|u1|",
  compressor=None, # ignoring compression for now
)
refs[".zarray"] = refs[".zarray"].decode() # decode bytes as a python string
refs["0.0.0"] = ["{{ path }}/A1-doi-20200531T210155-PSII0-1.png"]
refs["1.0.0"] = ["{{ path }}/A1-doi-20200531T210155-PSII0-2.png"]

spec = dict(
  version=1,
  templates=dict(path="file://data/psII/dataset-A1-20200531"),
  refs=refs
)

with open('reference.json', mode='w') as fh:
  fh.write(json.dumps(spec))

but that gives NotImplementedError: Only works with async targets

Unfortunately, I think this is likely a current limitation of the fsspec reference implementation. Use cases for ReferenceFileSystem have generally been targeted at reading large binary files (HDF5, NET-CDF, TIFF) remotely via http/s3/gcs, so instead of file://data/psII/dataset-A1-20200531, targets are generally s3://data/psII/dataset-A1-20200531. It's certainly fixable, just the reference implementation got added to fsspec recently.

A note on compression

Finally, I haven't addressed the issue to the "chunks" being encoded as PNG. By default, Zarr uses various codecs from a library called numcodecs do decode each Zarr Array chunk. A PNG codec is not included in numcodecs so Zarr cannot decode each chunk unless you add a special codec to a registry. Fortunately, I found today that imagecodecs now exposes some codec implementations intended to be used in Zarr!

I haven't tried this with PNG (but I have with JPEG). When writing .zarray metadata (above), you'll need to provide an actual codec to init_array:

# write_reference.py
from imagecodecs.numcodecs import Png

refs = {}
# writes ".zarray" to refs
init_array(
  refs,
  shape=(2,) + example_chunk.shape,
  chunks=(1,) + example_chunk.shape,
  dtype="|u1|",
  compressor=Png(), # writes { "id": "imagecodecs_png" },  tells zarr client "get the imagecodecs_png codec to decode each chunk!"
)

Then, when you use Zarr, you'll need to run a function from imagecodecs at the top of your script that sets up the image codecs in the registry

# read_reference.py
import zarr
from imagecodecs.numcodecs import register_codecs
register_codecs() # adds all image codecs to the zarr registry
# use zarr

I hope this comment adds some clarity to how to use the ReferenceFileSystem in your situation. One potential issue I see is that the reference file system is read-only at the moment, so if you need the ability to write chunks with different key names (e.g. *.png), a different approach (a la custom Store) is likely needed.

@martindurant
Copy link
Member

Note on NotImplementedError: Only works with async targets:

this is a current limitation of ReferenceFileSystem, and so it only works with HTTP, s3, GCS and Azure. Making it also work with local files is totally doable, but (in my opinion) less useful. Its having to download only small parts of potentially massive remote data and getting parallel access to archives that are the bigger wins for fsspec-reference-maker.

@dschneiderch
Copy link
Author

Ok, Thanks for the thorough explanation! However, none of that will work for me without implementation of ReferenceFileSystem for local files, right?

Our use is primarily to package the stack of image files as a single object so read-only would be ok. Howver, we need it to work locally. most processing is still happening locally but I was hoping this would allow us to scale up to remote stores too. plantcv only works with a single input. that is, give a list of image files, it handles parallel processing for each file via dask. we have cases where groups of images need to be kept together though so we thought we could use zarr to preprocess the stack and create new "files" containing groups of images without duplicating the data.

@cgohlke
Copy link
Contributor

cgohlke commented Apr 7, 2021

The TiffSequence class from the tifffile package might be able to create a zarr array from existing PNG images, e.g.:

import zarr
import tifffile
import imagecodecs

with tifffile.TiffSequence('*.png', imread=imagecodecs.imread) as pngs:
    with pngs.aszarr(codec=imagecodecs.png_decode) as store:
        za = zarr.open(store, mode='r')
        print(za)
        print(za[:])  # loads the full dataset

The ZarrFileStore does not currently allow to export a fsspec ReferenceFileSytem.

Supporting local files in the ReferenceFileSystem would be very useful also for development and testing.

@dschneiderch
Copy link
Author

Using tiffsequence worked great to create a zarr object (once i upgraded tifffile with conda main instead of conda-forge).
correct me if i'm wrong:
the code reads all *.png files into a tiffsequence which is then converted to a zarr object. now the arrays from each png are stored in memory within the zarr store framework.

But I saved it without converting to zarr first:
zarr.save('data.zarr', pngs.asarray(codec=imagecodecs.png_decode))
and this will simply save a N-D array in a binary format which I can load with zarr.load() and access the slices

Is there a way to save za, for example where each png file is copied into the zarr store, or is that where I run into trouble with fsspec and potentially en/decoding?

fwiw i would be interested in loading this with xarray with labels for each png. alternatively i could load the pngs into xarray and then save to disk as netcdf (or zarr i guess) but i was trying to avoid duplicating the data.

@cgohlke
Copy link
Contributor

cgohlke commented Apr 8, 2021

once i upgraded tifffile with conda main instead of conda-forge

conda-forge should work.

the code reads all *.png files into a tiffsequence which is then converted to a zarr object. now the arrays from each png are stored in memory within the zarr store framework.

The store object is a read only zarr store where each file in the sequence contains one zarr chunk. Nothing is loaded into memory until the zarr array is indexed (actually the first file is read on store init to determine the chunk size and dtype). E.g. za[10] reads the 11th PNG file in the sequence and return the image as a numpy array. za[:10] reads the first 10 PNG files and returns the image data as a stacked numpy array. The decoding of the PNGs is not handled by zarr but by the store object (via imagecodecs.png_decode in this case).

zarr.save('data.zarr', pngs.asarray(codec=imagecodecs.png_decode))

This will read the images from the whole file sequence into memory and then save it as a separate, writable zarr array.

fwiw i would be interested in loading this with xarray with labels for each png

If you are interested in organizing your files into a higher dimensional zarr array, TiffSequence takes an optional regular expression pattern that matches axes and sequence indices in the file names. That can quite complicated: https://github.com/cgohlke/tifffile/blob/581d7a5d4d7784154066b9f11a0167bc08570b7c/tests/test_tifffile.py#L12686-L12729

@martindurant
Copy link
Member

Fair enough, I'll look into non-async soon. It ought not to be too hard.

@martindurant
Copy link
Member

fsspec/filesystem_spec#604

@jhamman
Copy link
Member

jhamman commented Dec 7, 2023

Closing now that tifffile and fsspec support this use case!

@jhamman jhamman closed this as completed Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants