Skip to content

Harvested resources

Ricardo Garcia Silva edited this page Apr 13, 2021 · 5 revisions

Here you can read the description of the purpose and the "acceptance criteria" (i.e. what we promise)

A resource can be harvested in one of two ways:

  1. The resource's metadata is copied over to GeoNode and the actual data remains at the original location

  2. Both the metadata and the data are copied over to GeoNode. This is effectively a copy operation. The user shall be able to choose whether to maintain the original link to the remote server or to unlink it and turn the harvested resource into a new resource

    From the GeoNode point of view, harvested resources shall mostly be read-only. The source of truth remains the original server and modifications to the resource shall be either done on the remote server or be proxied to it by the local GeoNode. The exception to this is when the user chose to unlink a harvested data resource, as described in point 2 above.

    Implications to GeoNode data model

    A harvested resource that keeps the link to the remote server shall be treated differently by GeoNode:

    • Resource ownership and publishing information shall be set according to the harvester settings (owner, group, permissions). These properties can also be changed after the resource has been created on the local GeoNode.

    • Most of the remaining original resource's metadata is simply copied and remains in a read-only state. For most metadata fields this does not seem to be problematic. However, some modifications need to be made:

      • Currently a resource has only one date (creation, publishing, modification). However, for a harvested resource it is relevant to know both the original date set by the remote server and also the date of harvesting. So GeoNode shall introduce an additional date property to these resources
      • An additional harvested resource keyword shall be added to the resource in order to make it easier for end users to find harvested resources
      • The responsible parties are not registered users on the local GeoNode. As such it will be necessary to modify these GeoNode resource properties to allow filling in their details from data that does not come from known users. This applies to the Point of Contact and Metadata author properties. As for the Owner, it shall be set to a preconfigured local user

Draft workflow

  1. Harvester is configured by the user. It features some periodic scheduling.

    Harvesting configuration shall be stored in a Django model. This shall allow a privileged user to modify the configuration at runtime. Stuff that we want to be able to configure dynamically:

    • Harvester type: GeoNode, CKAN, GEM, Generic CSW, etc
    • URL of the remote service
    • Harvester paused or active
    • Harvesting periodicity
    • Ownership of harvested resources
    • Visibility of harvested resources
    • harvester type-specific settings

    Scheduler design still TBD

  2. When the time comes, the harvester is triggered by a scheduler (it can also be triggered manually by the user).

    1. Harvester shall execute asynchronously, in a separate process
    2. Communication with GeoNode shall be made via storing state on the main GeoNode DB
    3. We want to store a number of state related to the harvesting process itself too, not just on the harvested resources. This means we need something like a HarvestSession DB model.
    4. Harvesting sessions shall be able to report their state. They shall also be actionable - the user should be able to interrupt an ongoing harvesting operation
  3. The harvester then proceeds to check the remote service for existing resources. The way this check is done shall vary depending on the type of harvester

    • If the harvester is of the GeoNode type, it shall use GeoNode's CSW API to get the existing resources
    • A harvester of the CKAN type shall use the CKAN REST API
    • etc.
  4. For each found resource, the harvester reports existing metadata to the resource manager. The resource manager is an entity that shall be responsible for doing the actual generation of new GeoNode resources. Creating new GeoNode resources is a potentially multi-step activity, that may involve communication with GeoServer (or other third-party application). The resource manager may even ask the harvester to download the actual data during this step

  5. The resource manager creates a resource descriptor by using the metadata reported by the harvester and also by adding additional information from the internal GeoNode DB (for example, if the resource already exists in the DB or not). This resource descriptor is sent back to the harvester as an acknowledgment of the reported resource

  6. The resource manager then proceeds to update or generate GeoNode resources for each resource found by the harvester. This may entail:

    1. Creating a new GeoNode resource, if the remote resource is new
    2. Updating a GeoNode resource's metadata properties
    3. Asking the Harvester to download the resource (if needed)
    4. Updating a GeoNode resource's data
    5. Updating a GeoNode resource's style
    6. Updating a GeoNode resource's thumbnail
Clone this wiki locally