Update to the format spec based on discussions part 1 (#29)

trinitylake-io · Jan 25, 2025 · 4dc5d2d · 4dc5d2d
1 parent 2cd73a8
commit 4dc5d2d
Show file tree

Hide file tree

Showing 18 changed files with 314 additions and 228 deletions.
diff --git a/core/src/main/java/io/trinitylake/storage/local/LocalOutputStream.java b/core/src/main/java/io/trinitylake/storage/local/LocalOutputStream.java
@@ -34,7 +34,7 @@ public LocalOutputStream(
       LocalStorageOpsProperties localProperties) {
     this.file = file;
     this.tempFile =
-        FileUtil.createTempFile("lcoal-output", commonProperties.writeStagingDirectory());
+        FileUtil.createTempFile("local-", commonProperties.writeStagingDirectory());
     try {
       this.stream = new FileOutputStream(tempFile);
     } catch (FileNotFoundException e) {

diff --git a/docs/format/.pages b/docs/format/.pages
@@ -2,10 +2,11 @@ nav:
     - overview.md
     - versioning.md
     - storage-layout.md
-    - storage-path.md
+    - storage-location.md
+    - storage-transaction.md
     - transaction.md
-    - object-definition-file.md
     - key-encoding.md
+    - object-definition-file.md
     - lakehouse.md
     - namespace.md
     - table

diff --git a/docs/format/key-encoding.md b/docs/format/key-encoding.md
@@ -4,13 +4,6 @@
 
     we use the literal `[space]` to represent the space character (hex value 20) in this document for clarity
 
-## First Byte
-
-The first byte of a key in a TrinityLake tree is used to differentiate user-facing object definitions in Lakehouse 
-vs any other system-internal object definitions such as [Lakehouse](#lakehouse-key).
-User-facing object keys must start with a `[space]`,
-and system-internal object keys must not start with a `[space]`.
-
 ## System Internal Keys
 
 In general, system internal keys do not participate in the TrinityLake tree key sorting algorithm and always stay in 
@@ -26,7 +19,7 @@ The pointer to the previous root node is stored with key `previous_root` in the
 
 ### Rollback Root Node Key
 
-The pointer to the root node that was rolled back from, if the root node is created during a [Rollback](./transaction.md#rollback-committed-version)
+The pointer to the root node that was rolled back from, if the root node is created during a [Rollback](./storage-transaction#rollback-committed-version)
 It is stored with key `rollback_from_root` in the root node.
 
 ### Version Key
@@ -76,12 +69,12 @@ For example, schema ID `4` is encoded to `D===`.
 
 ### Object Key Format
 
-The object key format combines all the [First Byte](#first-byte), [Encoded Object Name](#object-name), 
+The object key format combines the [Encoded Object Name](#object-name), 
 [Encoded Object Definition Schema ID](#encoded-object-definition-schema-id) rules above to form a unique key 
-for each type of object. In more details, it is defined as the following: (contents in `<>` should be substituted)
+for each type of object. See the table below for the format for each type of object: (contents in `<>` should be substituted)
 
 | Object Type | Schema ID | Object ID Format                                               | Example                                               |
 |-------------|-----------|----------------------------------------------------------------|-------------------------------------------------------|
 | Lakehouse   | 0         | N/A, use [Lakehouse Definition Key](#lakehouse-definition-key) |                                                       |
-| Namespace   | 1         | `[space]B===<encoded namespace name>`                          | `[space]B===default[space]`                           |
-| Table       | 2         | `[space]C===<encoded namespace name><encoded table name>`      | `[space]C===default[space]table[space][space][space]` |
+| Namespace   | 1         | `[space]B===<encoded namespace name>`                          | `B===default[space]`                           |
+| Table       | 2         | `[space]C===<encoded namespace name><encoded table name>`      | `C===default[space]table[space][space][space]` |
diff --git a/docs/format/overview.md b/docs/format/overview.md
@@ -13,22 +13,22 @@ Please see [Versioning](./versioning.md) about the versioning semantics of this
 
 ## Introduction
 
-The TrinityLake format defines a Lakehouse-specific [key-value map](tree/search-tree-map.md) 
-implemented using a [B-epsilon tree](tree/b-epsilon-tree.md).
+The TrinityLake format defines a storage-only lakehouse 
+implemented using a modified version of the [B-epsilon tree](tree/b-epsilon-tree.md) [key-value map](tree/search-tree-map.md).
 
-- The keys of this map are IDs of objects in a Lakehouse
+- The keys of this map are IDs of objects in a lakehouse
 - The values of this map are location pointers to the **Object Definitions** 
 
-We denote such tree as the **TrinityLake Tree**, 
-and denote a Lakehouse implemented using the TrinityLake format as a **Trinity Lakehouse**.
+We call such tree as the **TrinityLake Tree**, 
+and call a lakehouse implemented using the TrinityLake format as a **Trinity Lakehouse**.
 
 The TrinityLake format contains the following specifications:
 
-- The TrinityLake tree is persisted in storage and follows [Storage Specification](./storage-layout).
-- The TrinityLake tree is assessed and updated following the [Transaction Specification](./transaction.md).
-- The object definitions are persisted in storage and follows the [Object Definition File Specification](./object-definition-file.md).
-- The key names in a TrinityLake tree follow the [Key Encoding Specification](./key-encoding.md).
-- The locations used in a TrinityLake tree follow the [Location Specification](./storage-path).
+- The TrinityLake tree is persisted in storage following [Storage Layout Specification](./storage-layout).
+- The files in a TrinityLake tree are stored according to the [Storage Location Specification](./storage-location).
+- The TrinityLake tree is assessed and updated following the [Transaction Specification](./storage-transaction).
+- The keys in a TrinityLake tree follow the [Key Encoding Specification](./key-encoding.md).
+- The object definitions in a TrinityLake tree follow the [Object Definition File Specification](./object-definition-file.md).
 
 ## Example
 

diff --git a/docs/format/storage-layout.md b/docs/format/storage-layout.md
@@ -1,48 +1,79 @@
 # Storage Layout
 
-The Lakehouse tree in general follows the storage layout of [N-way search tree map](tree/search-tree-map.md#storage-layout).
-Each node file is in the [Apache Arrow IPC format](https://arrow.apache.org/docs/format/Columnar.html#format-ipc).
+The TrinityLake tree in general follows [the storage layout of N-way search tree map](./tree/search-tree-map.md).
+In this document, we describe the details of the tree's layout in storage.
+
+## Node File Format
+
+Similar to a [N-way search tree map](./tree/search-tree-map.md), 
+each node of the TrinityLake tree is a [node file](./tree/search-tree-map.md#node-file) in storage. 
+Each file fully describes tabular data using the [Apache Arrow IPC format](https://arrow.apache.org/docs/format/Columnar.html#format-ipc).
 
 ## Node File Schema
 
-| ID | Name  | Arrow Type | Description                                          | Required? | Default |
-|----|-------|------------|------------------------------------------------------|-----------|---------|
-| 1  | key   | String     | Name of the key                                      | no        |         |
-| 2  | value | String     | The value of the key                                 | no        |         |
-| 3  | pnode | String     | File location pointer to the value of the child node | no        |         |
-| 4  | txn   | String     | Transaction ID for [write buffer](./#write-buffer)   | no        |         |
+The node file has the following schema:
+
+| ID | Name  | Arrow Type | Description                                        | Required? | Default |
+|----|-------|------------|----------------------------------------------------|-----------|---------|
+| 1  | key   | String     | Name of the key                                    | no        |         |
+| 2  | value | String     | The value of the key                               | no        |         |
+| 3  | pnode | String     | Pointer to the path to the child node              | no        |         |
+| 4  | txn   | String     | Transaction ID for [write buffer](./#write-buffer) | no        |         |
+
+## Node File Content
+
+Each node file contains 3 sections from top to bottom:
+
+- System internal rows
+- [Node key table](./tree/search-tree-map.md#node-key-table)
+- Write buffer
 
-## System-Internal Rows for Root Node
+They all share the same [node file schema](#node-file-schema) above, but use it in different ways.
 
-[System-internal keys](./key-encoding.md#system-internal-keys) will appear as the top rows in the file.
-There is no specific ordering required for the system-internal rows.
+## System-Internal Rows
 
-## Node Pointers
+Only the `key` and `value` columns in the [node file schema](#node-file-schema) are meaningful to system internal rows,
+and they are required to be non-null.
 
-To read the node pointers, the reader for the tree should skip the keys until it reaches the ones starting with `[space]`.
-There are exactly `N` rows of the file are reserved for the `N` pointers of each node for a tree of order `N`.
+These rows are used for recording system internal information such as node creation time, version, etc.
+See [system-internal keys](./key-encoding.md#system-internal-keys) for more details.
 
-The fist row in these `N` rows must have `key` and `value` as `NULL` because the first pointer points to all
-keys that are smaller than the key of the second row.
+There is no specific ordering expected for the system-internal rows, and there might be more system internal rows added over time.
+because the first row of the [node key table](./tree/search-tree-map.md#node-key-table) must have `NULL` key and `NULL` value,
+readers of a node file are expected to treat all rows before this row as system internal rows.
 
-A node might not have all `N` child nodes yet. If there are `k <= N` child nodes,
-There will be `N-k` rows with all column values as `NULL`s.
+## Node Key Table
+
+To read the node key table, the reader should skip all system internal rows, 
+which means to skip all rows until it reaches the first row that has a `NULL` key and `NULL` value.
+
+Then based on the rules of the [node key table](./tree/search-tree-map.md#node-key-table),
+There are exactly `N` rows for the node key table section of the node file.
 
 ## Write Buffer
 
-The write buffer rows start after the node pointer rows.
-These rows must have a `key` that is not `NULL`, and the `pnode` is always `NULL`.
+The [B-epsilon tree](./tree/b-epsilon-tree.md)-like write buffer of the TrinityLake tree starts after
+the [system internal rows](#system-internal-rows) and the [node key table](#node-key-table) rows.
+
+Each row in the write buffer represents a message to be applied to the TrinityLake tree.
+New messages are appended at the bottom of the write buffer.
+These rows have the following requirements:
 
-- When the `value` is `NULL`, it is a message to delete the `key`.
-- When the `value` is not `NULL`, it is a message to set the current `pvalue` of the key in the tree to the new one in the write buffer.
+1. `key` must not be `NULL`
+2. `transaction` must not be `NULL`
+3. `pnode` must be `NULL`
+4. If `value` is `NULL`, it is a message to delete the key. If `value` is not `NULL`, it is a message to set the key to the specific value.
 
-New changes are appended to the bottom of the write buffer.
+Note that different from a standard [B-epsilon tree](./tree/b-epsilon-tree.md),
+when flushing write buffer against the TrinityLake tree during a [write](./tree/b-epsilon-tree.md#write) or 
+[compaction](./tree/b-epsilon-tree.md#compaction), the messages in the latest committed transaction will not be flushed,
+because it will be used for ensuring different level of isolation guarantee during Trinity Lakehouse commit phase.
+See [Transaction and ACID Enforcement](./storage-transaction) for more details.
 
 ## Node File Size
 
 Each node is targeted for the same specific size, which is configurable in the [Lakehouse definition](./lakehouse.md).
-
-The estimated size of the `N` rows should be:
+Based on those configurations, users can roughly estimate the size of the node key table as:
 
 ```
 N * (
@@ -58,3 +89,4 @@ This remaining size is used as the write buffer for each node.
 
 For users that would like to fine-tune the performance characteristics of a TrinityLake tree,
 this formula can be used to readjust the node file size to achieve the desired epsilon value.
+
diff --git a/docs/format/storage-path.md → docs/format/storage-location.md b/docs/format/storage-path.md → docs/format/storage-location.md
@@ -1,4 +1,15 @@
-# Storage Path
+# Storage Location
+
+In the [Storage Layout](./storage-layout.md) document, we have described how a TrinityLake tree is persisted in a storage.
+This document describes the specification for the location of persisted files.
+
+## Terminology
+
+We in general use 3 terminologies:
+
+- Path: a path 
+- URI: a URI is a fully qualified address with scheme, authority, path components
+- Location: a location can be either a relative path, or a fully qualified URI
 
 ## Root in Lakehouse Storage
 
@@ -36,22 +47,9 @@ then the location value stored in the TrinityLake format should be `my-table-def
     - [External table location](./table/table-type.md#external)
 
 
-## Non-Root Node File Path
-
-Non-root node file name will be in the form of prefix `node-` plus a version 4 UUID with suffix `.ipc`.
-For example, if a UUID `6fcb514b-b878-4c9d-95b7-8dc3a7ce6fd8` is generated for the node file,
-the original file name of the node file will be `node-6fcb514b-b878-4c9d-95b7-8dc3a7ce6fd8.ipc`,
-and that further goes through the [file name optimization](./storage-path#optimized-file-name)
-to produce the final node file name.
-
-For root node file, please refer to [Transaction Specification](./transaction.md#root-node-file-name)
-for more details since the name is involved as a part of the transaction process.
-
-## Lakehouse Definition File Path
-
-should be `_lakehouse_def_` plus UUID plus `.binpb`
+## Standard File Paths
 
-## Optimized File Path
+### File Path Optimization
 
 A file path in the TrinityLake format is designed for optimized performance in storage.
 Given an **Original File Name**, the **Optimized File Name** in storage can be calculated as the following:
@@ -66,16 +64,44 @@ Given an **Original File Name**, the **Optimized File Name** in storage can be c
 For example, an original file name `my-table-definition.binpb` will be transformed to 
 `0101/0101/0101/10101100-my-table-definition.binpb`.
 
-!!!Note
-
-    Not all the file names will be optimized in this way. Here are the exceptions:
-
-    - [Root node file name](./transaction.md#root-node-file-name)
-    - [Version hint file name](./transaction.md#root-node-latest-version-hint-file)
-    - [Lakehouse definition file name](./transaction.md#
-
-
 !!!Warning
 
     File name optimization is a write side feature, and should not be used by readers to reverse-engineer
     the original file name.
+
+
+### Non-Root Node File Path
+
+Non-root node file name will be in the form of prefix `node-` plus a version 4 UUID with suffix `.ipc`.
+For example, if a UUID `6fcb514b-b878-4c9d-95b7-8dc3a7ce6fd8` is generated for the node file,
+the original file name of the node file will be `node-6fcb514b-b878-4c9d-95b7-8dc3a7ce6fd8.ipc`,
+and that further goes through the [file name optimization](./storage-location#optimized-file-name)
+to produce the final node file name.
+
+### Object Definition File Path
+
+TODO
+
+
+## Non-Standard File Paths
+
+### Root Node File Path
+
+With CoW, the root node file name is important because every change to the tree would create a new root node file,
+and the root node file name can be used essentially as the version of the tree.
+
+TrinityLake defines that each root node has a numeric version number,
+and the root node is stored in a file name `_<version_number_binary_reversed>.ipc`.
+The file name is persisted in storage as is without [optimization](./storage-location#optimized-file-name).
+For example, the 100th version of the root node file would be stored with name `_00100110000000000000000000000000.ipc`.
+
+### Root Node Latest Version Hint File Path
+
+A file with name `_latest_hint.txt` is stored and marks the hint to the latest version of the TrinityLake tree root node file.
+The file name is persisted in storage as is without [optimization](./storage-location#optimized-file-name)
+The file contains a number that marks the presumably latest version of the tree root node, such as `100`.
+
+
+### Lakehouse Definition File Path
+
+should be `_lakehouse_def_` plus UUID plus `.binpb`