Skip to content
This repository has been archived by the owner on Aug 29, 2023. It is now read-only.

Making data local should be a workflow step #287

Closed
forman opened this issue Jul 7, 2017 · 9 comments
Closed

Making data local should be a workflow step #287

forman opened this issue Jul 7, 2017 · 9 comments

Comments

@forman
Copy link
Member

forman commented Jul 7, 2017

Expected behavior

The operation open_dataset should be extended by a force_local: bool = False option and a local_ds_id: str = None.

If force_local == True and ds_id refers to a remote data source, then

  • local_ds_idis either given or will be set to a new ID which is made up of the ds_id and any given constraints.
  • if a local data source with ID local_ds_id already exists and is valid, then it will be opened
  • if a local data source with ID local_ds_id does not yet exist, it will be created by downloading the remote data, then the local version will be opened

If force_local == False the behavior is as it is now.

The new desired behavior is that we want to make the data access part of a workflow so the enclosing workspace can be shared and can run anywhere, regardless of the current state of the user's local data store.

Actual behavior

Local data source must be created before any workflow is created. If a workflow step's open_dataset operation then refers to such a local data source, it cannot be shared without telling how the local data source was named, which remote ID was used, and which constraints have been used to create it.

Specifications

Cate 0.9.0dev3

@barsten
Copy link

barsten commented Jul 7, 2017

I fully agree.
Question: what will be the behaviour if the ds is not local and the remote store is not available?

@forman
Copy link
Member Author

forman commented Jul 7, 2017

Poff!

@forman
Copy link
Member Author

forman commented Jul 18, 2017

I'd like to see this in version 1.0 as this is how Cate should make datasets local in the future. It will also further simplify and clarify data access API, CLI and GUI.

@forman
Copy link
Member Author

forman commented Jul 19, 2017

Local data source:

  • id: generate a 32-digit hash from remote-ID, spatio-temporal constraints, variable names
  • title: title field of remote DS (if any, otherwise use remote ID) plus any spatio-temporal constraints, variable names in clear text

@forman
Copy link
Member Author

forman commented Jul 19, 2017

Remember to keep (or at least recognize) the local. ID prefix as current code is making use of it.

@forman
Copy link
Member Author

forman commented Sep 6, 2017

Remember the requirements for the data source ID:

  1. it must be unique in the directory of the local file system
  2. same remote IDs plus same constraints must produce the same local ID

@forman
Copy link
Member Author

forman commented Sep 6, 2017

To be discussed: allow local datasets to grow in time.

@kbernat
Copy link
Collaborator

kbernat commented Sep 6, 2017

For ID my suggestion is to combine local prefix, remote name, constrains hash

local.esacci.GHG.satellite-orbit-frequency.L2.CH4.multi-sensor.multi-platform.VARIOUS.ch4_v1-0.r1.78685637-a631-3eaa-993e-969fa6134da2

And any human-readable description keep in Title.

kbernat pushed a commit that referenced this issue Sep 7, 2017
@forman forman added the blocker label Sep 13, 2017
@forman
Copy link
Member Author

forman commented Sep 13, 2017

Made this a blocker because #314 depends on its resolution and is a cu_mandatory

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants