Skip to content

Commit

Permalink
maintenance: more architecture description
Browse files Browse the repository at this point in the history
  • Loading branch information
lnielsen committed Nov 10, 2021
1 parent 539d7dc commit 94ad0b4
Show file tree
Hide file tree
Showing 3 changed files with 160 additions and 16 deletions.
1 change: 1 addition & 0 deletions architecture.drawio
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
<mxfile host="Electron" modified="2021-11-10T16:24:45.754Z" agent="5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/15.4.0 Chrome/91.0.4472.164 Electron/13.5.0 Safari/537.36" etag="hyAAV2-XiX_YF9Sr3LEq" version="15.4.0" type="device"><diagram name="Page-1" id="5f0bae14-7c28-e335-631c-24af17079c00">3Vrfc6M2EP5rPHP3UA8gfsSPvSTttZPrZeLptG8dGTagiUBUyLGdv77CCNtIxCbu2WA/Ga9gQd/37WoXNEK36fJXjvPkG4uAjhwrWo7Q3chxbNvy5U9pWVWWySSoDDEnkTppa5iSN1BGS1nnJIKicaJgjAqSN40hyzIIRcOGOWeL5mnPjDbvmuMYDMM0xNS0/kUikWzmZW0HvgKJE3XrG08NzHD4EnM2z9T9MpZBNZLi2o06tUhwxBY7JnQ/QrecMVEdpctboCWsNWLVdb+8M7p5ZA6Z6HLBG3/7Ov1n+S9eLfyX7M8X93nh/aS8vGI6V1A8ciikRywIy+QIxSvgagJiVeO1njGUjq0R+rJIiIBpjsNydCEVIm2JSKn8Z8tDdQvgApbvPru9QUSKDFgKgq/kKeoCVIOo9OW46v9iy5Zb25IdomobVgKJN663SMkDBdYHgHMM4KbAX4mEYKiYIb9vzJCB2R0WuLxTGEJRDBY596Zv5LwWtclcGOH1YxeC8TLBOT6VD/JlJhH04/Lo0z3FhSBhAZiHyWcDWJmS8vIwXFEiEeboMLyziouH2cawyYDf50K6AWUvqiRvez+GE89rcuJNTE4mLZRMTkWJfyQl4/H4oonw/YEREbQsYSQ9REMkM88MF3DRXCA0MC5ss54w8IUs+rks2UqIZXoqSFgCIzAXtXldRDUAbq4AEMniTTlkXCQsZhmm91urwrY8bz+y8sHYnIewb0oKK/mAMYhDFYFJ1Q4VXgsVtY0DlSXXa/N52/hRd3hkRM5kmx5tbcnyNYqreaqrdqtE3ZEW3q6tOapwMByt5bKZ9v9QkGUo6EnmVR6ZOqJUtgdwlorAs7wmKo4ZaEELu8g7VaCZ9ee+QFMR1YiybfANI9BQxzizrXaqzhRo7ju14YcDTXOEgjMHmnt9qbqzhHpVkN6XoWMVFDQdOboUT60gs597gmJOxaBSNbJQz6nabN4uPVU7XeMMDSlVo6NrIs2Rc+5UbfaaR6TqoYko6CiiXjXkaHW1f+xy72h1ta9n/VNryGyTr0BD3iVoyNe69MA7UkMb8dWOrDNr6OYaNeRfgoYCbQkK9FfNXTUUTLQ8pL8LOrWGJoaG/v72YMrojBUjmjQrRrte3vuqGOsGYQej36ff/+gVJHdwIJmvGnsHSe89+gfJbNGuIGPXOwQOth+TPlO2oy37xtfBzqWjtuzbZ07ZjvmmaDwe9xpo/uAC7SpbtK6BVimkt/r6UHx0ro20iLX1iD11oF1lj9ZZRL2+ldW5d/TeqrOIdDXq3d6pRWQW2NNVISCVtmcCNCrMr+F38zQH3jJQVlTTMIEUm2O/SUEtR+VGtzwnWVzsWRA2n9Qpm0dnWSBsV/tg53nGAmH7LXKyrVMtEajtPbAO6i3LnkncZk9zlsnJt5D0HkGPwFMiU8F6H1/OKAlX5kkPJHspoxPSXMYPtLlfbx4aGLtasE5a2G1b/239ze2PY9csAD7dwecpcIIpeWuNrieoksqeodF6z6tESrRoAiisURG4eBl2+NktO+jaw89xx+jmwxzJv9uNtFUm3W5URvf/AQ==</diagram></mxfile>
171 changes: 155 additions & 16 deletions docs/maintenance/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,37 @@ InvenioRDM has a layered architecture that consistent of three layers:
- Service layer
- Data access layer

There is a strict data flow between the layers, and each layer has very specific responsibilities. It's highly important that you as a developer know the basic principles for the data flow and each layer's responsibilities. Failure to understand the basic data flow, leads to using the wrong objects for the wrong things, which eventually turns into messy code.
There is a strict data flow between the layers, and each layer has very specific responsibilities. It's highly important that you as a developer know the basic principles for the data flow and each layer's responsibilities. Failure to understand the basic data flow, leads to using the wrong objects for the wrong things, which eventually turns into messy unmaintainable code.

**Data flow basics**

The diagram below shows a simplified view of the data flow in the architecture.

![Architecture layers](img/architecture.svg)

*The presentation layer* parses incoming requests and routes them to service layer. This involves sending and receiving data in multiple different formats and translating these into an internal representation, as well as e.g. parsing arguments from an HTTP request (e.g parsing the query string parameters).

*The service layer* is completely independent from the presentation layer and can be used by many different presentation interfaces such as REST APIs, CLIs, Celery tasks. The service layer contains the overall control flow and is responsible for e.g. checking permissions and performing semantic data validation.

*The data access layer* is responsible for ensuring data integrity, harmonizing data access to different storages as well as fetching and storing the data in the underlying systems.

The data flow between the layers is strictly limited to some few well-defined objects to ensure a clean separation of concerns. The presentation layer communicates with the service layer via a e.g. a record projection (i.e. a view of a record localised to a specific identity). The service layer communicates with the data access layer via e.g. a record entity that provides data abstraction, syntactic data validation, and a strong programmatic API.

!!! tip "Tip: Where do you belong?"

A key question you should always ask yourself when designing or writing code is where you code belongs in the architecture:

- Is it a presentation, service, or data access layer object?
- Is the object crossing boundaries between layers?

Answering where you code belongs helps identity and disentangle responsibilities.


### Data access layer

The data access layer is responsible for:

- Fetching and storing data on primary (the database) and secondary storage (Elasticsearch, cache, ...).
- Fetching and storing data on primary (the database) and secondary storage (Elasticsearch, cache, files, ...).
- Harmonizing data access to the same object on primary and secondary storages (e.g. a record in the database vs in the Elasticsearch index).
- Ensuring data integrity and managing relations among data objects.

Expand All @@ -39,21 +63,65 @@ The data access layer usually lives inside an Invenio module in a package named
- System fields (``/records/systemfields/``).
- Dumpers (``/records/dumpers/``).

**Principles**
**Purpose**

TODO
The data access layer serves two purposes:

- Provide a strong programmatic API that produce a clean, simple and reliable
control flow in the service layer.
- Persist our business objects on data storage in an reliable and performant
way.

!!! tip "Tip: Messy service layer?"

- Data representation
If you service layer code looks messy, likely you need to work on your data
access layer.

- One primary storage, many secondary storages
A typical example is the service layer doing data-wrangling with
dictionaries. For instance a conditional get on a dictionary key (e.g.
``data.get('...')``), or having to e.g. convert back and forth between
data types (e.g. UUIDs to/from strings).

- Data versioning
**Guiding principles**

- Denormalize full objects
The data layer is built around the following guiding principles:

- One data representation: The service layer should work with one an only one
data representation of an entity independent of if the entity was retrieved
from primary or secondary storage.

- One primary storage, many secondary storages: The primary version of a record
exists in one and only one copy on the primary storage (the database),
however multiple secondary copies may exist in the search index.

- Idempotence of dumping/loading: Dumping and loading to/from secondary storage
(such as the search index) must produce the same record.

- Denormalization over normalization: If we have to choose, we usually prefer
fast read speed over fast write speed.

- Data versioning: We version data and rely heavily on optimistic
concurrency control for detecting conflicts and determining stale secondary
copies.

**Record API**

TODO
The record API is the primary programmatic API that the service layer uses to
work with the data access layer. The record API ensures data integrity and manages
the life-cycle of the record itself and related objects such as persistent
identifiers and files.

The record is in charge of:

- define the structural schema that data is validated against (using
JSONSchemas).
- define search index routing and indexing behaviour.
- managing the life-cycle of an associated persistent identifier.
- data versioning
- state management

A record is usually defined using a declarative API named system fields based
on Python data descriptors.

**JSONSchemas**

Expand Down Expand Up @@ -90,11 +158,11 @@ System fields are responsible for:
- manages relations with other objects
- hooking into the record life-cycle

System fields basically provides a programmatic API that makes it easier to work with records and related objects. Under the hood, system fields are Python data descriptors.
System fields basically provides a declarative programmatic API that makes it easier to work with records and related objects. Under the hood, system fields are Python data descriptors.

A key design principle for system fields, is that an *instance* of a system field manages a single namespace of a record so that system fields do not conflict. For instance an access system field manages the top-level ``access`` key in a record ``{'access': ...}``.

System fields participate in the dumping/loading of records from secondary storage via being able to hook into the record life-cycle. The difference between system fields and dumpers, is that a dumpers produce a dump fo a specific secondary storage system, while system fields produce the same dump for all secondary storage systems.
System fields participate in the dumping/loading of records from secondary storage via being able to hook into the record life-cycle. The difference between system fields and dumpers, is that a dumpers produce a dump for a specific secondary storage system, while system fields produce the same dump for all secondary storage systems.

System fields may be used to manage relations to other objects, and can work similar to a foreign key.

Expand All @@ -108,7 +176,7 @@ System fields to a large degree avoids building inheritance among record APIs an

**SQLAlchemy models**

SQLAlchemy record models are responsible for storing the master version of a record (i.e. the primary storage). All record models share some few common properties:
SQLAlchemy record models are responsible for storing the master version of a record (i.e. the primary storage) and provide database independence. All record models share some few common properties:

- A JSON column for storing the JSON-encoded document of a record.
- An internal UUID identifier.
Expand All @@ -121,7 +189,7 @@ It's important to understand that there's two distinct representations of a reco
- Python dictionary
- JSON document

These two distinct representations of a record may often be very similar, but it's important to understand that the JSON document is constrained to the JSON object model, while the Python dictionary can hold more rich types as long as they are JSON-serializable.
These two distinct representations of a record may often be very similar, but it's important to understand that the JSON document is constrained to the JSON object model, while the Python dictionary can hold more rich data types as long as they are JSON-serializable (e.g. a datetime object).

### Service Layer

Expand All @@ -131,15 +199,19 @@ The service layer contains the business logic of the application and is responsi
- Business-level validation
- Control flow - e.g. transaction management,

**Principles**
**Guiding principles**

TODO

- Mimick the end-user interface

- Clean control flow

- Interface independent

- Independent of the Flask request context

-Data flow
- Data flow

- Components responsible for setting data on a record.

Expand Down Expand Up @@ -189,4 +261,71 @@ Responsible for providing a specific feature in the service, and make the servic

The presentation layer

## Customizations
**Guiding principles**

TODO

**Celery tasks**

TODO

**Views**

TODO

**Resources**

TODO

- RESTful routing
- Dependency injection

**Resources request context**

TODO

- Contains only validated parsed data.
- HTTP request parsing: body, headers, query string, path
- Content negotiation

**Resource configs**

TODO

- Dependency injection

## How did we arrive here?

The overarching goal of the architecture is similar to any other software
system. We want a software system that's easily maintainable, scalable,
extendable, adaptable, resilient, and *...insert your favorite buzz words...*.

There's a lot of methodologies and patterns on how to build and architect
software systems. However, in practice, while methodologies are useful
it's often more about tradeoffs and finding the right balance rather than
strict application of a specific methodology. Most of the time you have to deal
with deadlines, requirements, design patterns, costs, legacy code, people,
projects, prior history and practices.

InvenioRDM is no different. The architecture is largely a by product our past
experiences and challenges we've faced. The architecture as described here, is
not meant to be final answer, but rather an evolving architecture that adapts
and improve over time.

TODO

Some of the challenges we faced:

- **High developer turn-over and many juniors**: Onboarding, documentation,
boundaries, spaghetti code.
- **Spaghetti code**: data massaging all over the place, type conversions.
- **Bad design choices**: moving big files,
- **Recovering from failures**: massive database crashes, file loss on big
distributed storage clusters, and eating our own dog food.

## Why not?

TODO

- Microservices: Is not a substitute for an architecture. It's just another way of tieing a system together. Running becomes harder especially for small institutions.
- NoSQL: SQL database have been around for the past 40 years, and are highly reliable systems. Most NoSQL systems have not been around for so long.
Loading

0 comments on commit 94ad0b4

Please sign in to comment.