maintenance: more architecture description

inveniosoftware · Nov 10, 2021 · 94ad0b4 · 94ad0b4
1 parent 539d7dc
commit 94ad0b4
Show file tree

Hide file tree

Showing 3 changed files with 160 additions and 16 deletions.
diff --git a/architecture.drawio b/architecture.drawio
@@ -0,0 +1 @@
+<mxfile host="Electron" modified="2021-11-10T16:24:45.754Z" agent="5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/15.4.0 Chrome/91.0.4472.164 Electron/13.5.0 Safari/537.36" etag="hyAAV2-XiX_YF9Sr3LEq" version="15.4.0" type="device"><diagram name="Page-1" id="5f0bae14-7c28-e335-631c-24af17079c00">3Vrfc6M2EP5rPHP3UA8gfsSPvSTttZPrZeLptG8dGTagiUBUyLGdv77CCNtIxCbu2WA/Ga9gQd/37WoXNEK36fJXjvPkG4uAjhwrWo7Q3chxbNvy5U9pWVWWySSoDDEnkTppa5iSN1BGS1nnJIKicaJgjAqSN40hyzIIRcOGOWeL5mnPjDbvmuMYDMM0xNS0/kUikWzmZW0HvgKJE3XrG08NzHD4EnM2z9T9MpZBNZLi2o06tUhwxBY7JnQ/QrecMVEdpctboCWsNWLVdb+8M7p5ZA6Z6HLBG3/7Ov1n+S9eLfyX7M8X93nh/aS8vGI6V1A8ciikRywIy+QIxSvgagJiVeO1njGUjq0R+rJIiIBpjsNydCEVIm2JSKn8Z8tDdQvgApbvPru9QUSKDFgKgq/kKeoCVIOo9OW46v9iy5Zb25IdomobVgKJN663SMkDBdYHgHMM4KbAX4mEYKiYIb9vzJCB2R0WuLxTGEJRDBY596Zv5LwWtclcGOH1YxeC8TLBOT6VD/JlJhH04/Lo0z3FhSBhAZiHyWcDWJmS8vIwXFEiEeboMLyziouH2cawyYDf50K6AWUvqiRvez+GE89rcuJNTE4mLZRMTkWJfyQl4/H4oonw/YEREbQsYSQ9REMkM88MF3DRXCA0MC5ss54w8IUs+rks2UqIZXoqSFgCIzAXtXldRDUAbq4AEMniTTlkXCQsZhmm91urwrY8bz+y8sHYnIewb0oKK/mAMYhDFYFJ1Q4VXgsVtY0DlSXXa/N52/hRd3hkRM5kmx5tbcnyNYqreaqrdqtE3ZEW3q6tOapwMByt5bKZ9v9QkGUo6EnmVR6ZOqJUtgdwlorAs7wmKo4ZaEELu8g7VaCZ9ee+QFMR1YiybfANI9BQxzizrXaqzhRo7ju14YcDTXOEgjMHmnt9qbqzhHpVkN6XoWMVFDQdOboUT60gs597gmJOxaBSNbJQz6nabN4uPVU7XeMMDSlVo6NrIs2Rc+5UbfaaR6TqoYko6CiiXjXkaHW1f+xy72h1ta9n/VNryGyTr0BD3iVoyNe69MA7UkMb8dWOrDNr6OYaNeRfgoYCbQkK9FfNXTUUTLQ8pL8LOrWGJoaG/v72YMrojBUjmjQrRrte3vuqGOsGYQej36ff/+gVJHdwIJmvGnsHSe89+gfJbNGuIGPXOwQOth+TPlO2oy37xtfBzqWjtuzbZ07ZjvmmaDwe9xpo/uAC7SpbtK6BVimkt/r6UHx0ro20iLX1iD11oF1lj9ZZRL2+ldW5d/TeqrOIdDXq3d6pRWQW2NNVISCVtmcCNCrMr+F38zQH3jJQVlTTMIEUm2O/SUEtR+VGtzwnWVzsWRA2n9Qpm0dnWSBsV/tg53nGAmH7LXKyrVMtEajtPbAO6i3LnkncZk9zlsnJt5D0HkGPwFMiU8F6H1/OKAlX5kkPJHspoxPSXMYPtLlfbx4aGLtasE5a2G1b/239ze2PY9csAD7dwecpcIIpeWuNrieoksqeodF6z6tESrRoAiisURG4eBl2+NktO+jaw89xx+jmwxzJv9uNtFUm3W5URvf/AQ==</diagram></mxfile>
diff --git a/docs/maintenance/architecture.md b/docs/maintenance/architecture.md
@@ -20,13 +20,37 @@ InvenioRDM has a layered architecture that consistent of three layers:
 - Service layer
 - Data access layer
 
-There is a strict data flow between the layers, and each layer has very specific responsibilities. It's highly important that you as a developer know the basic principles for the  data flow and  each layer's responsibilities. Failure to understand the basic data flow, leads to using the wrong objects for the wrong things, which eventually turns into messy code.
+There is a strict data flow between the layers, and each layer has very specific responsibilities. It's highly important that you as a developer know the basic principles for the  data flow and  each layer's responsibilities. Failure to understand the basic data flow, leads to using the wrong objects for the wrong things, which eventually turns into messy unmaintainable code.
+
+**Data flow basics**
+
+The diagram below shows a simplified view of the data flow in the architecture.
+
+![Architecture layers](img/architecture.svg)
+
+*The presentation layer* parses incoming requests and routes them to service layer. This involves sending and receiving data in multiple different formats and translating these into an internal representation, as well as e.g. parsing arguments from an HTTP request (e.g parsing the query string parameters).
+
+*The service layer* is completely independent from the presentation layer and can be used by many different presentation interfaces such as REST APIs, CLIs, Celery tasks. The service layer contains the overall control flow and is responsible for e.g. checking permissions and performing semantic data validation.
+
+*The data access layer* is responsible for ensuring data integrity, harmonizing data access to different storages as well as fetching and storing the data in the underlying systems.
+
+The data flow between the layers is strictly limited to some few well-defined objects to ensure a clean separation of concerns. The presentation layer communicates with the service layer via a e.g. a record projection (i.e. a view of a record localised to a specific identity). The service layer communicates with the data access layer via e.g. a record entity that provides data abstraction, syntactic data validation, and a strong programmatic API.
+
+!!! tip "Tip: Where do you belong?"
+
+    A key question you should always ask yourself when designing or writing code is where you code belongs in the architecture:
+
+    - Is it a presentation, service, or data access layer object?
+    - Is the object crossing boundaries between layers?
+
+    Answering where you code belongs helps identity and disentangle responsibilities.
+
 
 ### Data access layer
 
 The data access layer is responsible for:
 
-- Fetching and storing data on primary (the database) and secondary storage (Elasticsearch, cache, ...).
+- Fetching and storing data on primary (the database) and secondary storage (Elasticsearch, cache, files, ...).
 - Harmonizing data access to the same object on primary and secondary storages (e.g. a record in the database vs in the Elasticsearch index).
 - Ensuring data integrity and managing relations among data objects.
 
@@ -39,21 +63,65 @@ The data access layer usually lives inside an Invenio module in a package named
 - System fields (``/records/systemfields/``).
 - Dumpers (``/records/dumpers/``).
 
-**Principles**
+**Purpose**
 
-TODO
+The data access layer serves two purposes:
+
+- Provide a strong programmatic API that produce a clean, simple and reliable
+  control flow in the service layer.
+- Persist our business objects on data storage in an reliable and performant
+  way.
+
+!!! tip "Tip: Messy service layer?"
 
-- Data representation
+    If you service layer code looks messy, likely you need to work on your data
+    access layer.
 
-- One primary storage, many secondary storages
+    A typical example is the service layer doing data-wrangling with
+    dictionaries. For instance a conditional get on a dictionary key (e.g.
+    ``data.get('...')``), or having to e.g. convert back and forth between
+    data types (e.g. UUIDs to/from strings).
 
-- Data versioning
+**Guiding principles**
 
-- Denormalize full objects
+The data layer is built around the following guiding principles:
+
+- One data representation: The service layer should work with one an only one
+  data representation of an entity independent of if the entity was retrieved
+  from primary or secondary storage.
+
+- One primary storage, many secondary storages: The primary version of a record
+  exists in one and only one copy on the primary storage (the database),
+  however multiple secondary copies may exist in the search index.
+
+- Idempotence of dumping/loading: Dumping and loading to/from secondary storage
+  (such as the search index) must produce the same record.
+
+- Denormalization over normalization: If we have to choose, we usually prefer
+  fast read speed over fast write speed.
+
+- Data versioning: We version data and rely heavily on optimistic
+  concurrency control for detecting conflicts and determining stale secondary
+  copies.
 
 **Record API**
 
-TODO
+The record API is the primary programmatic API that the service layer uses to
+work with the data access layer. The record API ensures data integrity and manages
+the life-cycle of the record itself and related objects such as persistent
+identifiers and files.
+
+The record is in charge of:
+
+- define the structural schema that data is validated against (using
+  JSONSchemas).
+- define search index routing and indexing behaviour.
+- managing the life-cycle of an associated persistent identifier.
+- data versioning
+- state management
+
+A record is usually defined using a declarative API named system fields based
+on Python data descriptors.
 
 **JSONSchemas**
 
@@ -90,11 +158,11 @@ System fields are responsible for:
 - manages relations with other objects
 - hooking into the record life-cycle
 
-System fields basically provides a programmatic API that makes it easier to work with records and related objects. Under the hood, system fields are Python data descriptors.
+System fields basically provides a declarative programmatic API that makes it easier to work with records and related objects. Under the hood, system fields are Python data descriptors.
 
 A key design principle for system fields, is that an *instance* of a system field manages a single namespace of a record so that system fields do not conflict. For instance an access system field manages the top-level ``access`` key in a record ``{'access': ...}``.
 
-System fields participate in the dumping/loading of records from secondary storage via being able to hook into the record life-cycle. The difference between system fields and dumpers, is that a dumpers produce a dump fo a specific secondary storage system, while system fields produce the same dump for all secondary storage systems.
+System fields participate in the dumping/loading of records from secondary storage via being able to hook into the record life-cycle. The difference between system fields and dumpers, is that a dumpers produce a dump for a specific secondary storage system, while system fields produce the same dump for all secondary storage systems.
 
 System fields may be used to manage relations to other objects, and can work similar to a foreign key.
 
@@ -108,7 +176,7 @@ System fields to a large degree avoids building inheritance among record APIs an
 
 **SQLAlchemy models**
 
-SQLAlchemy record models are responsible for storing the master version of a record (i.e. the primary storage). All record models share some few common properties:
+SQLAlchemy record models are responsible for storing the master version of a record (i.e. the primary storage) and provide database independence. All record models share some few common properties:
 
 - A JSON column for storing the JSON-encoded document of a record.
 - An internal UUID identifier.
@@ -121,7 +189,7 @@ It's important to understand that there's two distinct representations of a reco
 - Python dictionary
 - JSON document
 
-These two distinct representations of a record may often be very similar, but it's important to understand that the JSON document is constrained to the JSON object model, while the Python dictionary can hold more rich types as long as they are JSON-serializable.
+These two distinct representations of a record may often be very similar, but it's important to understand that the JSON document is constrained to the JSON object model, while the Python dictionary can hold more rich data types as long as they are JSON-serializable (e.g. a datetime object).
 
 ### Service Layer
 
@@ -131,15 +199,19 @@ The service layer contains the business logic of the application and is responsi
 - Business-level validation
 - Control flow - e.g. transaction management,
 
-**Principles**
+**Guiding principles**
 
 TODO
 
 - Mimick the end-user interface
 
+- Clean control flow
+
+- Interface independent
+
 - Independent of the Flask request context
 
- -Data flow
+- Data flow
 
 - Components responsible for setting data on a record.
 
@@ -189,4 +261,71 @@ Responsible for providing a specific feature in the service, and make the servic
 
 The presentation layer
 
-## Customizations
+**Guiding principles**
+
+TODO
+
+**Celery tasks**
+
+TODO
+
+**Views**
+
+TODO
+
+**Resources**
+
+TODO
+
+- RESTful routing
+- Dependency injection
+
+**Resources request context**
+
+TODO
+
+- Contains only validated parsed data.
+- HTTP request parsing: body, headers, query string, path
+- Content negotiation
+
+**Resource configs**
+
+TODO
+
+- Dependency injection
+
+## How did we arrive here?
+
+The overarching goal of the architecture is similar to any other software
+system. We want a software system that's easily maintainable, scalable,
+extendable, adaptable, resilient, and *...insert your favorite buzz words...*.
+
+There's a lot of methodologies and patterns on how to build and architect
+software systems. However, in practice, while methodologies are useful
+it's often more about tradeoffs and finding the right balance rather than
+strict application of a specific methodology. Most of the time you have to deal
+with deadlines, requirements, design patterns, costs, legacy code, people,
+projects, prior history and practices.
+
+InvenioRDM is no different. The architecture is largely a by product our past
+experiences and challenges we've faced. The architecture as described here, is
+not meant to be final answer, but rather an evolving architecture that adapts
+and improve over time.
+
+TODO
+
+Some of the challenges we faced:
+
+- **High developer turn-over and many juniors**: Onboarding, documentation,
+  boundaries, spaghetti code.
+- **Spaghetti code**: data massaging all over the place, type conversions.
+- **Bad design choices**: moving big files,
+- **Recovering from failures**: massive database crashes, file loss on big
+  distributed storage clusters, and eating our own dog food.
+
+## Why not?
+
+TODO
+
+- Microservices: Is not a substitute for an architecture. It's just another way of tieing a system together. Running becomes harder especially for small institutions.
+- NoSQL: SQL database have been around for the past 40 years, and are highly reliable systems. Most NoSQL systems have not been around for so long.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		<mxfile host="Electron" modified="2021-11-10T16:24:45.754Z" agent="5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) draw.io/15.4.0 Chrome/91.0.4472.164 Electron/13.5.0 Safari/537.36" etag="hyAAV2-XiX_YF9Sr3LEq" version="15.4.0" type="device"><diagram name="Page-1" id="5f0bae14-7c28-e335-631c-24af17079c00">3Vrfc6M2EP5rPHP3UA8gfsSPvSTttZPrZeLptG8dGTagiUBUyLGdv77CCNtIxCbu2WA/Ga9gQd/37WoXNEK36fJXjvPkG4uAjhwrWo7Q3chxbNvy5U9pWVWWySSoDDEnkTppa5iSN1BGS1nnJIKicaJgjAqSN40hyzIIRcOGOWeL5mnPjDbvmuMYDMM0xNS0/kUikWzmZW0HvgKJE3XrG08NzHD4EnM2z9T9MpZBNZLi2o06tUhwxBY7JnQ/QrecMVEdpctboCWsNWLVdb+8M7p5ZA6Z6HLBG3/7Ov1n+S9eLfyX7M8X93nh/aS8vGI6V1A8ciikRywIy+QIxSvgagJiVeO1njGUjq0R+rJIiIBpjsNydCEVIm2JSKn8Z8tDdQvgApbvPru9QUSKDFgKgq/kKeoCVIOo9OW46v9iy5Zb25IdomobVgKJN663SMkDBdYHgHMM4KbAX4mEYKiYIb9vzJCB2R0WuLxTGEJRDBY596Zv5LwWtclcGOH1YxeC8TLBOT6VD/JlJhH04/Lo0z3FhSBhAZiHyWcDWJmS8vIwXFEiEeboMLyziouH2cawyYDf50K6AWUvqiRvez+GE89rcuJNTE4mLZRMTkWJfyQl4/H4oonw/YEREbQsYSQ9REMkM88MF3DRXCA0MC5ss54w8IUs+rks2UqIZXoqSFgCIzAXtXldRDUAbq4AEMniTTlkXCQsZhmm91urwrY8bz+y8sHYnIewb0oKK/mAMYhDFYFJ1Q4VXgsVtY0DlSXXa/N52/hRd3hkRM5kmx5tbcnyNYqreaqrdqtE3ZEW3q6tOapwMByt5bKZ9v9QkGUo6EnmVR6ZOqJUtgdwlorAs7wmKo4ZaEELu8g7VaCZ9ee+QFMR1YiybfANI9BQxzizrXaqzhRo7ju14YcDTXOEgjMHmnt9qbqzhHpVkN6XoWMVFDQdOboUT60gs597gmJOxaBSNbJQz6nabN4uPVU7XeMMDSlVo6NrIs2Rc+5UbfaaR6TqoYko6CiiXjXkaHW1f+xy72h1ta9n/VNryGyTr0BD3iVoyNe69MA7UkMb8dWOrDNr6OYaNeRfgoYCbQkK9FfNXTUUTLQ8pL8LOrWGJoaG/v72YMrojBUjmjQrRrte3vuqGOsGYQej36ff/+gVJHdwIJmvGnsHSe89+gfJbNGuIGPXOwQOth+TPlO2oy37xtfBzqWjtuzbZ07ZjvmmaDwe9xpo/uAC7SpbtK6BVimkt/r6UHx0ro20iLX1iD11oF1lj9ZZRL2+ldW5d/TeqrOIdDXq3d6pRWQW2NNVISCVtmcCNCrMr+F38zQH3jJQVlTTMIEUm2O/SUEtR+VGtzwnWVzsWRA2n9Qpm0dnWSBsV/tg53nGAmH7LXKyrVMtEajtPbAO6i3LnkncZk9zlsnJt5D0HkGPwFMiU8F6H1/OKAlX5kkPJHspoxPSXMYPtLlfbx4aGLtasE5a2G1b/239ze2PY9csAD7dwecpcIIpeWuNrieoksqeodF6z6tESrRoAiisURG4eBl2+NktO+jaw89xx+jmwxzJv9uNtFUm3W5URvf/AQ==</diagram></mxfile>