Skip to content

Commit

Permalink
Move dataset loading methods out into a single-dispatch function (#274)
Browse files Browse the repository at this point in the history
* Move dataset loading methods out into a single-dispatch function

It's a bit messy at the moment, will want some tidying up

* Update more loading commands that I somehow missed

* Forgot to commit the actual loader because I'm a fool

* Changelog

* Found another old call hiding in the docs

* Swap one no-op for another to appease the coverage gods

* Code is now constructed so that this will never run

* Add basic test for loading TiledDataset

* Use a test file that actually exists. Maybe that will help

* Update dkist/dataset/loader.py

Co-authored-by: Stuart Mumford <[email protected]>

* Update dkist/dataset/loader.py

Co-authored-by: Stuart Mumford <[email protected]>

* Revive Dataset.from_ loading methods and deprecate

Plus tests

* Raise error for unrecognised types in single-dispatch

* Pass load_dataset up to dkist

* Update changelog/274.feature.rst

Co-authored-by: Stuart Mumford <[email protected]>

* Use load_dataset in aia generation script

* Update docs to use load_dataset

* Add note for later work

* Add jsonschema as explicit dependency

Fixes #276

* asdf needs jsconschema <4 apparently

* Missed a from_asdf

* Helps if you use variables that exist

* Use pytest.warns in tests instead of pytest.raises

Because it wasn't marking code as run

* Add test to cover invalid input to loader

* Replace AstropyDeprecationWarning with new DKISTDeprecationWarning

* Imports are hard

* Forgot to add some important things

* Apply suggestions from code review

* Update dkist/dataset/loader.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Stuart Mumford <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Aug 1, 2023
1 parent 914f873 commit 1cdd106
Show file tree
Hide file tree
Showing 15 changed files with 145 additions and 57 deletions.
1 change: 1 addition & 0 deletions changelog/274.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add a new `dkist.load_dataset` function to combine and replace ``Dataset.from_directory()`` and ``Dataset.from_asdf()``.
1 change: 1 addition & 0 deletions changelog/274.trivial.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Add jsonschema as an explicit dependency (previously it was provided by asdf).
4 changes: 2 additions & 2 deletions dkist/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@

import astropy.config as _config

from .dataset import Dataset, TiledDataset # noqa
from .dataset import Dataset, TiledDataset, load_dataset # noqa
from .utils.sysinfo import system_info # noqa

try:
Expand All @@ -14,7 +14,7 @@
# package is not installed
__version__ = "unknown"

__all__ = ['TiledDataset', 'Dataset', 'system_info']
__all__ = ['TiledDataset', 'Dataset', 'load_dataset', 'system_info']


def write_default_config(overwrite=False):
Expand Down
3 changes: 2 additions & 1 deletion dkist/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from astropy.time import Time
from sunpy.coordinates.frames import Helioprojective

from dkist import load_dataset
from dkist.data.test import rootdir
from dkist.dataset import Dataset
from dkist.dataset.tiled_dataset import TiledDataset
Expand Down Expand Up @@ -303,4 +304,4 @@ def large_visp_dataset(tmp_path_factory):
ds.generate_files(vispdir)
dataset_from_fits(vispdir, "test_visp.asdf")

return Dataset.from_asdf(vispdir / "test_visp.asdf")
return load_dataset(vispdir / "test_visp.asdf")
1 change: 1 addition & 0 deletions dkist/dataset/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
from .dataset import Dataset
from .loader import load_dataset
from .tiled_dataset import TiledDataset
43 changes: 7 additions & 36 deletions dkist/dataset/dataset.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,15 @@
import importlib.resources as importlib_resources
from pathlib import Path
from textwrap import dedent

import numpy as np
from jsonschema.exceptions import ValidationError

import asdf
import gwcs
from astropy.wcs.wcsapi.wrappers import SlicedLowLevelWCS
from ndcube.ndcube import NDCube, NDCubeLinkedDescriptor

from dkist.io.file_manager import FileManager
from dkist.utils.decorators import deprecated

from .loader import load_dataset
from .utils import dataset_info_str

__all__ = ['Dataset']
Expand Down Expand Up @@ -202,50 +200,23 @@ def inventory(self):
"""

@classmethod
@deprecated(since="1.0.0", alternative="load_dataset")
def from_directory(cls, directory):
"""
Construct a `~dkist.dataset.Dataset` from a directory containing one
asdf file and a collection of FITS files.
"""
base_path = Path(directory).expanduser()
if not base_path.is_dir():
raise ValueError("directory argument must be a directory")
asdf_files = tuple(base_path.glob("*.asdf"))

if not asdf_files:
raise ValueError("No asdf file found in directory.")
elif len(asdf_files) > 1:
raise NotImplementedError("Multiple asdf files found in this "
"directory. Use from_asdf to specify which "
"one to use.") # pragma: no cover

asdf_file = asdf_files[0]

return cls.from_asdf(asdf_file)
return load_dataset(directory)

@classmethod
@deprecated(since="1.0.0", alternative="load_dataset")
def from_asdf(cls, filepath):
"""
Construct a dataset object from a filepath of a suitable asdf file.
"""
from dkist.dataset import TiledDataset
filepath = Path(filepath).expanduser()
base_path = filepath.parent
try:
with importlib_resources.as_file(importlib_resources.files("dkist.io") / "level_1_dataset_schema.yaml") as schema_path:
with asdf.open(filepath, custom_schema=schema_path.as_posix(),
lazy_load=False, copy_arrays=True) as ff:
ds = ff.tree['dataset']
if isinstance(ds, TiledDataset):
for sub in ds.flat:
sub.files.basepath = base_path
else:
ds.files.basepath = base_path
return ds

except ValidationError as e:
err = f"This file is not a valid DKIST Level 1 asdf file, it fails validation with: {e.message}."
raise TypeError(err) from e

return load_dataset(filepath)

"""
Private methods.
Expand Down
75 changes: 75 additions & 0 deletions dkist/dataset/loader.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
import importlib.resources as importlib_resources
from pathlib import Path
from functools import singledispatch

from jsonschema.exceptions import ValidationError

import asdf


@singledispatch
def load_dataset(target):
"""Function to load a Dataset from either an asdf file path or a directory."""
raise TypeError("Input type not recognised. Must be a string or pathlib.Path referencing a "
".asdf file or a directory containing one.")


@load_dataset.register
def _load_from_string(path: str):
# TODO Adjust this to accept URLs as well
return _load_from_path(Path(path))


@load_dataset.register
def _load_from_path(path: Path):
path = path.expanduser()
if not path.is_dir():
if not path.exists():
raise ValueError(f"File {path} does not exist.")
return _load_from_asdf(path)
else:
return _load_from_directory(path)


def _load_from_directory(directory):
"""
Construct a `~dkist.dataset.Dataset` from a directory containing one
asdf file and a collection of FITS files.
"""
base_path = Path(directory).expanduser()
asdf_files = tuple(base_path.glob("*.asdf"))

if not asdf_files:
raise ValueError("No asdf file found in directory.")
elif len(asdf_files) > 1:
raise NotImplementedError("Multiple asdf files found in this "
"directory. Use from_asdf to specify which "
"one to use.") # pragma: no cover

asdf_file = asdf_files[0]

return _load_from_asdf(asdf_file)


def _load_from_asdf(filepath):
"""
Construct a dataset object from a filepath of a suitable asdf file.
"""
from dkist.dataset import TiledDataset
filepath = Path(filepath).expanduser()
base_path = filepath.parent
try:
with importlib_resources.as_file(importlib_resources.files("dkist.io") / "level_1_dataset_schema.yaml") as schema_path:
with asdf.open(filepath, custom_schema=schema_path.as_posix(),
lazy_load=False, copy_arrays=True) as ff:
ds = ff.tree['dataset']
if isinstance(ds, TiledDataset):
for sub in ds.flat:
sub.files.basepath = base_path
else:
ds.files.basepath = base_path
return ds

except ValidationError as e:
err = f"This file is not a valid DKIST Level 1 asdf file, it fails validation with: {e.message}."
raise TypeError(err) from e
53 changes: 41 additions & 12 deletions dkist/dataset/tests/test_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,14 @@
from astropy.tests.helper import assert_quantity_allclose

from dkist.data.test import rootdir
from dkist.dataset import Dataset
from dkist.dataset import Dataset, TiledDataset, load_dataset
from dkist.io import FileManager
from dkist.utils.exceptions import DKISTDeprecationWarning


@pytest.fixture
def invalid_asdf(tmpdir):
filename = Path(tmpdir / "test.asdf")
def invalid_asdf(tmp_path):
filename = Path(tmp_path / "test.asdf")
tree = {'spam': 'eggs'}
with asdf.AsdfFile(tree=tree) as af:
af.write_to(filename)
Expand All @@ -27,7 +28,7 @@ def invalid_asdf(tmpdir):

def test_load_invalid_asdf(invalid_asdf):
with pytest.raises(TypeError):
Dataset.from_asdf(invalid_asdf)
load_dataset(invalid_asdf)


def test_missing_quality(dataset):
Expand Down Expand Up @@ -71,38 +72,66 @@ def test_dimensions(dataset, dataset_3d):


def test_load_from_directory():
ds = Dataset.from_directory(os.path.join(rootdir, 'EIT'))
ds = load_dataset(os.path.join(rootdir, 'EIT'))
assert isinstance(ds.data, da.Array)
assert isinstance(ds.wcs, gwcs.WCS)
assert_quantity_allclose(ds.dimensions, (11, 128, 128)*u.pix)
assert ds.files.basepath == Path(os.path.join(rootdir, 'EIT'))


def test_from_directory_no_asdf(tmpdir):
def test_from_directory_no_asdf(tmp_path):
with pytest.raises(ValueError) as e:
Dataset.from_directory(tmpdir)
load_dataset(tmp_path)
assert "No asdf file found" in str(e)


def test_from_not_directory():
with pytest.raises(ValueError) as e:
Dataset.from_directory(rootdir / "notadirectory")
load_dataset(rootdir / "notadirectory")
assert "directory argument" in str(e)


def test_load_tiled_dataset():
ds = load_dataset(os.path.join(rootdir, 'test_tiled_dataset-1.0.0_dataset-1.1.0.asdf'))
assert isinstance(ds, TiledDataset)
assert ds.shape == (3, 3)


def test_load_with_old_methods():
with pytest.warns(DKISTDeprecationWarning):
ds = Dataset.from_directory(os.path.join(rootdir, 'EIT'))
assert isinstance(ds.data, da.Array)
assert isinstance(ds.wcs, gwcs.WCS)
assert_quantity_allclose(ds.dimensions, (11, 128, 128)*u.pix)
assert ds.files.basepath == Path(os.path.join(rootdir, 'EIT'))

with pytest.warns(DKISTDeprecationWarning) as e:
ds = Dataset.from_asdf(os.path.join(rootdir, 'EIT', "eit_test_dataset.asdf"))
assert isinstance(ds.data, da.Array)
assert isinstance(ds.wcs, gwcs.WCS)
assert_quantity_allclose(ds.dimensions, (11, 128, 128)*u.pix)
assert ds.files.basepath == Path(os.path.join(rootdir, 'EIT'))


def test_from_directory_not_dir():
with pytest.raises(ValueError) as e:
Dataset.from_directory(rootdir / 'EIT' / 'eit_2004-03-01T00_00_10.515000.asdf')
load_dataset(rootdir / 'EIT' / 'eit_2004-03-01T00_00_10.515000.asdf')
assert "must be a directory" in str(e)


def test_load_with_invalid_input():
with pytest.raises(TypeError) as e:
load_dataset(42)
assert "Input type not recognised." in str(e)


def test_crop_few_slices(dataset_4d):
sds = dataset_4d[0, 0]
assert sds.wcs.world_n_dim == 2


def test_file_manager():
dataset = Dataset.from_directory(os.path.join(rootdir, 'EIT'))
dataset = load_dataset(os.path.join(rootdir, 'EIT'))
assert dataset.files is dataset._file_manager
with pytest.raises(AttributeError):
dataset.files = 10
Expand All @@ -120,12 +149,12 @@ def test_no_file_manager(dataset_3d):


def test_inventory_propery():
dataset = Dataset.from_directory(os.path.join(rootdir, 'EIT'))
dataset = load_dataset(os.path.join(rootdir, 'EIT'))
assert dataset.inventory == dataset.meta['inventory']


def test_header_slicing_single_index():
dataset = Dataset.from_directory(os.path.join(rootdir, 'EIT'))
dataset = load_dataset(os.path.join(rootdir, 'EIT'))
idx = 5
sliced = dataset[idx]

Expand Down
5 changes: 2 additions & 3 deletions dkist/tests/generate_aia_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
from sunpy.net.jsoc import JSOCClient
from sunpy.time import parse_time

from dkist import load_dataset
from dkist.asdf_maker.helpers import generate_lookup_table


Expand Down Expand Up @@ -204,9 +205,7 @@ def main():

# import sys; sys.exit(0)

from dkist.dataset import Dataset

ds = Dataset.from_directory(str(path))
ds = load_dataset(str(path))
print(repr(ds))
print(repr(ds.wcs))
print(ds.wcs(*[1*u.pix]*4, with_units=True))
Expand Down
7 changes: 7 additions & 0 deletions dkist/utils/decorators.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from astropy.utils.decorators import deprecated as astropy_deprecated

from .exceptions import DKISTDeprecationWarning


def deprecated(*args, **kwargs):
return astropy_deprecated(*args, warning_type=DKISTDeprecationWarning, **kwargs)
2 changes: 2 additions & 0 deletions dkist/utils/exceptions.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
class DKISTDeprecationWarning(DeprecationWarning):
pass
2 changes: 1 addition & 1 deletion docs/guide/loading.rst
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ The only thing which is specific to the `~dkist.Dataset` class is the interactio
When you slice a dataset the new, smaller, dataset has a new `~dkist.Dataset.files` object which is unrelated to the one of the larger parent `~dkist.Dataset`.
This means that if you slice the dataset::

>>> ds = dkist.Dataset.from_asdf(myfilename)
>>> ds = dkist.load_dataset(myfilename)
>>> small_ds = ds[10:20, :, 5]

and then download the files corresponding to the smaller dataset::
Expand Down
2 changes: 1 addition & 1 deletion docs/guide/net.rst
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ How to do this is detailed in the next section, :ref:`loadinglevel1data`, but a

.. code-block:: python
ds = dkist.Dataset.from_asdf(files[0])
ds = dkist.load_dataset(files[0])
Once the dataset is loaded, we can use the `dkist.Dataset.files` property to manage where the dataset looks for the FITS files associated with the dataset.
By default the ``Dataset`` object will assume the FITS files are in the same directory as the ASDF file that was loaded.
Expand Down
2 changes: 1 addition & 1 deletion examples/create_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
###############################################################################
# Dataset objects are created from a directory containing one asdf file and
# many FITS files. Here we use a test dataset made from EIT images.
ds = dkist.dataset.Dataset.from_directory(EIT_DATASET)
ds = dkist.dataset.load_dataset(EIT_DATASET)

###############################################################################
# The dataset comprises of a `dask.Array` object and a `gwcs.WCS` object. The
Expand Down
1 change: 1 addition & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ install_requires =
parfive[ftp]>=1.5
sunpy[net,asdf]>=4.0.7
setuptools>=59
jsonschema>=3.2
setup_requires = setuptools_scm

[options.extras_require]
Expand Down

0 comments on commit 1cdd106

Please sign in to comment.