VertNet Modularization process (WIP)

VertNet Modularization How-To

Rationale
Proposed layout overview
1. Advantages
2. Drawbacks
Shared components
1. Data storage
2. yaml files
Module-specific components
Module api (default)
Module dwc-indexer
Module portal-web
Module tools-*
GitHub repository organization
Development cycle tips
1. Deploying a new module
2. Updating shared yaml files

Rationale

Proposed layout overview

I propose the following star-schema with a central module that serves data to accessory modules and packages.

A central api module as the core of all data-related operations
A set of UI-oriented modules linked to the api module to extract, format and serve the data to the users
A set of tool-oriented modules linked to the api module to operate on the data

Advantages

The main advantage of modularizing the project in this way (and, in fact, the very reason to do it) is that each module will be isolated from the other, to a certain extent, both in software and (virtual) hardware. This means more heavy-duty modules (such as the api or the dwc-indexer) can be configured to be run on higher-class instances while more modest instance classes are suitable for other lighter modules (such as the emailer). Also, not all modules need access to the same software modules, or make use of the same libraries and components.

There are a bunch of shared components, though, and this is for the good. Sharing these components means less code duplicity and more maintainability. See the Shared components section for a list and description of each one.

Drawbacks

Of course, there are a few drawbacks that we should take into account.

There is a hard limit of 20 modules per (paid) application.
There is a separate hard limit of 60 versions per (paid) application. That means the sum of all deployed versions from all modules must be equal to or less than 60.
We need to strongly encourage (maybe even enforce) a set of contribution best practices, to ensure some basic guidelines are followed.

Shared components

Data storage

All versions of all modules share the datastore and Cloud Storage buckets. That means that potentially every open endpoint could be used to access the underlying data, but it also means that there is no need to duplicate datastores to serve different sub-projects such as Dimensions of Amazonia or DIPnet.

There is a way of isolating pieces (entity Kinds) of the datastore by declaring Namespaces. This will restrict access of certain modules to certain sections of the datastore, thus avoiding data access conflicts.

`yaml` files

For a specific App Engine application, there must be only one of the following .yaml configuration files:

app.yaml: shared application configuration
queue.yaml: definition of task queues (such as download, usage stats, ...)
index.yaml: definition of non-standard indexes for datastore queries (such as descending orders, multi-property filters, ...)
cron.yaml: definition of cron tasks (such as monthly usage stats)
dispatch.yaml: definition of URL routing "aliases" (not currently used)

Module-specific components

Apart from the code itself, each module must present a yaml file with the definition of the module and instance information (such as instance class, scaling type and so). It must be named as the module itself. For example, the api module will have an associated api.yaml file that indicates the type of memory and CPU the instances will have.

Module `api` (default)

A single api module as the core element of the project in the broader sense (encompassing amazonia, dipnet and any other network). This module will be the central endpoint for all data-related operations, such as providing basic CRUD capabilities, launching download-building processes and any other data-intensive task. All other modules will refer to this one for such operations, meaning no other module will access the datastore directly. The main benefit is a more maintainable code, since all function to extract data will be found in this module.

Every project will access the underlying datastore via this module, and data fragmentation (making each project get only their data) will be perfomed at this level, by filtering based on the value of the networks field.

This also allows external packages (such as rvertnet) to directly access the data by querying the api module and making sure they always check on the latest available version of search and download methods.

Method `search`

Basic method to retrieve data from the datastore based on sent parameters. URL could be something like http://api.vertnet-portal.appspot.com/search?<parameters>

Method `download`

Basic method to build download files with data from the datastore based on sent parameters. URL could be something like http://api.vertnet-portal.appspot.com/download?<parameters>

Version `prod` (default)

The main version of the module, in charge of supplying data to requests coming from other modules.

Version `dev`

Development version, useful for testing new methods without hampering the production workflow. Only usable for within-module development. Other modules (even in development versions) will not access this version, but rather the prod version.

Module `dwc-indexer`

A core part of the data workflow, this module will be in charge of indexing the DarwinCore text files into App Engine's Search API documents.

Even though it might be included in the other tools-* module, the preeminent importance of the indexer and the fact that it might need some special configuration makes this tool worthy of having its own module.

UPDATE: 2016-07-24 The indexer component has been changed (ba170b49bf28813a5d23c62d7064f8470c0cf0e1) to be a module (dwc-indexer) and tested on more than half of a complete index reload with traits.

Module `portal-web`

The current webapp module, once freed from the data management parts, will become a more UI-oriented module. From the perspective of the users, its current functionality will remain untouched. But under the hood, it will retrieve the data via calls to the api module rather than extracting the records directly from the datastore.

Version `vertnet` (default)

The default version of the web portal, pointing to data from the vertnet network.

Version `tuco` / `dev` / `testing`...

Development versions to try new features and/or building new project-based portals.

Version `amazonia`

A different version of the vertnet portal for the Dimensions of Amazonia project, pointing to data from the amazonia network. UI layout can be different, but core funcionality will remain the same.

Version `amazonia-dev`

There is no impediment in having development sub-versions for each of the project-specific versions.

This same schema (amazonia + amazonia-dev) can be applied to any other number of projects, as long as we don't reach the limit of versions (see Drawbacks section above).

Module `tools-*`

Every other tool will have its associated module. Some tools will be open to the public, some will be restricted to internal usage.

Currently developed modules include:

API usage tracker (apitracker), private
Usage stat report generator (usagestats), private (the generator) and public (the viewer)
Batch emailer (emailer), private
Geospatial Quality API (api-geospatial), public

But there is potential to build many more, such as:

Migrators
Traits service
Gazetteer/locality service
Deduplication service
...

There is a limitation here imposed by Google App Engine: the maximum amount of modules for the whole application is 20. This layout implies "using" 7 module "slots" (3 for the core modules and one for each currently deployed tool). Currently there are very few modules implemented, but this might become an issue in the future.

As a potential solution, we could merge all tools into a single tools module. The drawback is a total lack of isolation: each module has to be deployed as a whole, and shares instance resources among all its components. This hampers parallel development and can cause many code consistency issues if comitters don't have a strong responsibility and foresight.

GitHub repository organization

An important part of the new schema is how to organize the code and assets and how to distribute them on GitHub repositories. I propose the following repository structure.

Core modules: public repos

Each "core" module (i.e. api and indexer for now) will have its own public repository. Apart from the code that implements the functionality, they will have their corresponding yaml module file, but no other yaml file.

Associated tools: public / private repos

Each tool module will have, in principle, its own module (as long as we don't hit the limits shown in the Drawbacks section). The nature of the repository (public/private) will depend on how sensitive the task of the module is, although I would stand up for more public than private repos. As with the core modules, each tool module will have its corresponding yaml module file and no other yaml file.

Shared and sensitive components: private repo

Since there must be only one of each shared component, I suggest creating a single private repository to hold them. Having a single place for them ensures there is an "authoritative" source for these files, and this helps keeping coherence among all modules. And keeping it private means in order to make modifications, one must be part of the VertNet org.

Having this private repo would also be beneficial for storing some API keys, like GitHub or CartoDB API keys, for the same reason as above -- this ensures an "authoritative" source for these files and in the unlikely case that the keys get compromised and need to be changed, the new keys can be uploaded there and everyone will have instant access.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VertNet Modularization process (WIP)

VertNet Modularization How-To

Rationale

Proposed layout overview

Advantages

Drawbacks

Shared components

Data storage

`yaml` files

Module-specific components

Module `api` (default)

Method `search`

Method `download`

Version `prod` (default)

Version `dev`

Module `dwc-indexer`

Module `portal-web`

Version `vertnet` (default)

Version `tuco` / `dev` / `testing`...

Version `amazonia`

Version `amazonia-dev`

Module `tools-*`

GitHub repository organization

Core modules: public repos

Associated tools: public / private repos

Shared and sensitive components: private repo

Development cycle tips

Deploying a new module

Updating shared `yaml` files

Clone this wiki locally

VertNet Modularization process (WIP)

VertNet Modularization How-To

Rationale

Proposed layout overview

Advantages

Drawbacks

Shared components

Data storage

yaml files

Module-specific components

Module api (default)

Method search

Method download

Version prod (default)

Version dev

Module dwc-indexer

Module portal-web

Version vertnet (default)

Version tuco / dev / testing...

Version amazonia

Version amazonia-dev

Module tools-*

GitHub repository organization

Core modules: public repos

Associated tools: public / private repos

Shared and sensitive components: private repo

Development cycle tips

Deploying a new module

Updating shared yaml files

Clone this wiki locally

`yaml` files

Module `api` (default)

Method `search`

Method `download`

Version `prod` (default)

Version `dev`

Module `dwc-indexer`

Module `portal-web`

Version `vertnet` (default)

Version `tuco` / `dev` / `testing`...

Version `amazonia`

Version `amazonia-dev`

Module `tools-*`

Updating shared `yaml` files