Skip to content

Commit

Permalink
Update to the format spec based on discussions part 1 (#29)
Browse files Browse the repository at this point in the history
  • Loading branch information
jackye1995 authored Jan 25, 2025
1 parent 2cd73a8 commit 4dc5d2d
Show file tree
Hide file tree
Showing 18 changed files with 314 additions and 228 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ public LocalOutputStream(
LocalStorageOpsProperties localProperties) {
this.file = file;
this.tempFile =
FileUtil.createTempFile("lcoal-output", commonProperties.writeStagingDirectory());
FileUtil.createTempFile("local-", commonProperties.writeStagingDirectory());
try {
this.stream = new FileOutputStream(tempFile);
} catch (FileNotFoundException e) {
Expand Down
5 changes: 3 additions & 2 deletions docs/format/.pages
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@ nav:
- overview.md
- versioning.md
- storage-layout.md
- storage-path.md
- storage-location.md
- storage-transaction.md
- transaction.md
- object-definition-file.md
- key-encoding.md
- object-definition-file.md
- lakehouse.md
- namespace.md
- table
Expand Down
17 changes: 5 additions & 12 deletions docs/format/key-encoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,6 @@

we use the literal `[space]` to represent the space character (hex value 20) in this document for clarity

## First Byte

The first byte of a key in a TrinityLake tree is used to differentiate user-facing object definitions in Lakehouse
vs any other system-internal object definitions such as [Lakehouse](#lakehouse-key).
User-facing object keys must start with a `[space]`,
and system-internal object keys must not start with a `[space]`.

## System Internal Keys

In general, system internal keys do not participate in the TrinityLake tree key sorting algorithm and always stay in
Expand All @@ -26,7 +19,7 @@ The pointer to the previous root node is stored with key `previous_root` in the

### Rollback Root Node Key

The pointer to the root node that was rolled back from, if the root node is created during a [Rollback](./transaction.md#rollback-committed-version)
The pointer to the root node that was rolled back from, if the root node is created during a [Rollback](./storage-transaction#rollback-committed-version)
It is stored with key `rollback_from_root` in the root node.

### Version Key
Expand Down Expand Up @@ -76,12 +69,12 @@ For example, schema ID `4` is encoded to `D===`.

### Object Key Format

The object key format combines all the [First Byte](#first-byte), [Encoded Object Name](#object-name),
The object key format combines the [Encoded Object Name](#object-name),
[Encoded Object Definition Schema ID](#encoded-object-definition-schema-id) rules above to form a unique key
for each type of object. In more details, it is defined as the following: (contents in `<>` should be substituted)
for each type of object. See the table below for the format for each type of object: (contents in `<>` should be substituted)

| Object Type | Schema ID | Object ID Format | Example |
|-------------|-----------|----------------------------------------------------------------|-------------------------------------------------------|
| Lakehouse | 0 | N/A, use [Lakehouse Definition Key](#lakehouse-definition-key) | |
| Namespace | 1 | `[space]B===<encoded namespace name>` | `[space]B===default[space]` |
| Table | 2 | `[space]C===<encoded namespace name><encoded table name>` | `[space]C===default[space]table[space][space][space]` |
| Namespace | 1 | `[space]B===<encoded namespace name>` | `B===default[space]` |
| Table | 2 | `[space]C===<encoded namespace name><encoded table name>` | `C===default[space]table[space][space][space]` |
20 changes: 10 additions & 10 deletions docs/format/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,22 +13,22 @@ Please see [Versioning](./versioning.md) about the versioning semantics of this

## Introduction

The TrinityLake format defines a Lakehouse-specific [key-value map](tree/search-tree-map.md)
implemented using a [B-epsilon tree](tree/b-epsilon-tree.md).
The TrinityLake format defines a storage-only lakehouse
implemented using a modified version of the [B-epsilon tree](tree/b-epsilon-tree.md) [key-value map](tree/search-tree-map.md).

- The keys of this map are IDs of objects in a Lakehouse
- The keys of this map are IDs of objects in a lakehouse
- The values of this map are location pointers to the **Object Definitions**

We denote such tree as the **TrinityLake Tree**,
and denote a Lakehouse implemented using the TrinityLake format as a **Trinity Lakehouse**.
We call such tree as the **TrinityLake Tree**,
and call a lakehouse implemented using the TrinityLake format as a **Trinity Lakehouse**.

The TrinityLake format contains the following specifications:

- The TrinityLake tree is persisted in storage and follows [Storage Specification](./storage-layout).
- The TrinityLake tree is assessed and updated following the [Transaction Specification](./transaction.md).
- The object definitions are persisted in storage and follows the [Object Definition File Specification](./object-definition-file.md).
- The key names in a TrinityLake tree follow the [Key Encoding Specification](./key-encoding.md).
- The locations used in a TrinityLake tree follow the [Location Specification](./storage-path).
- The TrinityLake tree is persisted in storage following [Storage Layout Specification](./storage-layout).
- The files in a TrinityLake tree are stored according to the [Storage Location Specification](./storage-location).
- The TrinityLake tree is assessed and updated following the [Transaction Specification](./storage-transaction).
- The keys in a TrinityLake tree follow the [Key Encoding Specification](./key-encoding.md).
- The object definitions in a TrinityLake tree follow the [Object Definition File Specification](./object-definition-file.md).

## Example

Expand Down
82 changes: 57 additions & 25 deletions docs/format/storage-layout.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,79 @@
# Storage Layout

The Lakehouse tree in general follows the storage layout of [N-way search tree map](tree/search-tree-map.md#storage-layout).
Each node file is in the [Apache Arrow IPC format](https://arrow.apache.org/docs/format/Columnar.html#format-ipc).
The TrinityLake tree in general follows [the storage layout of N-way search tree map](./tree/search-tree-map.md).
In this document, we describe the details of the tree's layout in storage.

## Node File Format

Similar to a [N-way search tree map](./tree/search-tree-map.md),
each node of the TrinityLake tree is a [node file](./tree/search-tree-map.md#node-file) in storage.
Each file fully describes tabular data using the [Apache Arrow IPC format](https://arrow.apache.org/docs/format/Columnar.html#format-ipc).

## Node File Schema

| ID | Name | Arrow Type | Description | Required? | Default |
|----|-------|------------|------------------------------------------------------|-----------|---------|
| 1 | key | String | Name of the key | no | |
| 2 | value | String | The value of the key | no | |
| 3 | pnode | String | File location pointer to the value of the child node | no | |
| 4 | txn | String | Transaction ID for [write buffer](./#write-buffer) | no | |
The node file has the following schema:

| ID | Name | Arrow Type | Description | Required? | Default |
|----|-------|------------|----------------------------------------------------|-----------|---------|
| 1 | key | String | Name of the key | no | |
| 2 | value | String | The value of the key | no | |
| 3 | pnode | String | Pointer to the path to the child node | no | |
| 4 | txn | String | Transaction ID for [write buffer](./#write-buffer) | no | |

## Node File Content

Each node file contains 3 sections from top to bottom:

- System internal rows
- [Node key table](./tree/search-tree-map.md#node-key-table)
- Write buffer

## System-Internal Rows for Root Node
They all share the same [node file schema](#node-file-schema) above, but use it in different ways.

[System-internal keys](./key-encoding.md#system-internal-keys) will appear as the top rows in the file.
There is no specific ordering required for the system-internal rows.
## System-Internal Rows

## Node Pointers
Only the `key` and `value` columns in the [node file schema](#node-file-schema) are meaningful to system internal rows,
and they are required to be non-null.

To read the node pointers, the reader for the tree should skip the keys until it reaches the ones starting with `[space]`.
There are exactly `N` rows of the file are reserved for the `N` pointers of each node for a tree of order `N`.
These rows are used for recording system internal information such as node creation time, version, etc.
See [system-internal keys](./key-encoding.md#system-internal-keys) for more details.

The fist row in these `N` rows must have `key` and `value` as `NULL` because the first pointer points to all
keys that are smaller than the key of the second row.
There is no specific ordering expected for the system-internal rows, and there might be more system internal rows added over time.
because the first row of the [node key table](./tree/search-tree-map.md#node-key-table) must have `NULL` key and `NULL` value,
readers of a node file are expected to treat all rows before this row as system internal rows.

A node might not have all `N` child nodes yet. If there are `k <= N` child nodes,
There will be `N-k` rows with all column values as `NULL`s.
## Node Key Table

To read the node key table, the reader should skip all system internal rows,
which means to skip all rows until it reaches the first row that has a `NULL` key and `NULL` value.

Then based on the rules of the [node key table](./tree/search-tree-map.md#node-key-table),
There are exactly `N` rows for the node key table section of the node file.

## Write Buffer

The write buffer rows start after the node pointer rows.
These rows must have a `key` that is not `NULL`, and the `pnode` is always `NULL`.
The [B-epsilon tree](./tree/b-epsilon-tree.md)-like write buffer of the TrinityLake tree starts after
the [system internal rows](#system-internal-rows) and the [node key table](#node-key-table) rows.

Each row in the write buffer represents a message to be applied to the TrinityLake tree.
New messages are appended at the bottom of the write buffer.
These rows have the following requirements:

- When the `value` is `NULL`, it is a message to delete the `key`.
- When the `value` is not `NULL`, it is a message to set the current `pvalue` of the key in the tree to the new one in the write buffer.
1. `key` must not be `NULL`
2. `transaction` must not be `NULL`
3. `pnode` must be `NULL`
4. If `value` is `NULL`, it is a message to delete the key. If `value` is not `NULL`, it is a message to set the key to the specific value.

New changes are appended to the bottom of the write buffer.
Note that different from a standard [B-epsilon tree](./tree/b-epsilon-tree.md),
when flushing write buffer against the TrinityLake tree during a [write](./tree/b-epsilon-tree.md#write) or
[compaction](./tree/b-epsilon-tree.md#compaction), the messages in the latest committed transaction will not be flushed,
because it will be used for ensuring different level of isolation guarantee during Trinity Lakehouse commit phase.
See [Transaction and ACID Enforcement](./storage-transaction) for more details.

## Node File Size

Each node is targeted for the same specific size, which is configurable in the [Lakehouse definition](./lakehouse.md).

The estimated size of the `N` rows should be:
Based on those configurations, users can roughly estimate the size of the node key table as:

```
N * (
Expand All @@ -58,3 +89,4 @@ This remaining size is used as the write buffer for each node.

For users that would like to fine-tune the performance characteristics of a TrinityLake tree,
this formula can be used to readjust the node file size to achieve the desired epsilon value.

76 changes: 51 additions & 25 deletions docs/format/storage-path.md → docs/format/storage-location.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,15 @@
# Storage Path
# Storage Location

In the [Storage Layout](./storage-layout.md) document, we have described how a TrinityLake tree is persisted in a storage.
This document describes the specification for the location of persisted files.

## Terminology

We in general use 3 terminologies:

- Path: a path
- URI: a URI is a fully qualified address with scheme, authority, path components
- Location: a location can be either a relative path, or a fully qualified URI

## Root in Lakehouse Storage

Expand Down Expand Up @@ -36,22 +47,9 @@ then the location value stored in the TrinityLake format should be `my-table-def
- [External table location](./table/table-type.md#external)


## Non-Root Node File Path

Non-root node file name will be in the form of prefix `node-` plus a version 4 UUID with suffix `.ipc`.
For example, if a UUID `6fcb514b-b878-4c9d-95b7-8dc3a7ce6fd8` is generated for the node file,
the original file name of the node file will be `node-6fcb514b-b878-4c9d-95b7-8dc3a7ce6fd8.ipc`,
and that further goes through the [file name optimization](./storage-path#optimized-file-name)
to produce the final node file name.

For root node file, please refer to [Transaction Specification](./transaction.md#root-node-file-name)
for more details since the name is involved as a part of the transaction process.

## Lakehouse Definition File Path

should be `_lakehouse_def_` plus UUID plus `.binpb`
## Standard File Paths

## Optimized File Path
### File Path Optimization

A file path in the TrinityLake format is designed for optimized performance in storage.
Given an **Original File Name**, the **Optimized File Name** in storage can be calculated as the following:
Expand All @@ -66,16 +64,44 @@ Given an **Original File Name**, the **Optimized File Name** in storage can be c
For example, an original file name `my-table-definition.binpb` will be transformed to
`0101/0101/0101/10101100-my-table-definition.binpb`.

!!!Note

Not all the file names will be optimized in this way. Here are the exceptions:

- [Root node file name](./transaction.md#root-node-file-name)
- [Version hint file name](./transaction.md#root-node-latest-version-hint-file)
- [Lakehouse definition file name](./transaction.md#


!!!Warning

File name optimization is a write side feature, and should not be used by readers to reverse-engineer
the original file name.


### Non-Root Node File Path

Non-root node file name will be in the form of prefix `node-` plus a version 4 UUID with suffix `.ipc`.
For example, if a UUID `6fcb514b-b878-4c9d-95b7-8dc3a7ce6fd8` is generated for the node file,
the original file name of the node file will be `node-6fcb514b-b878-4c9d-95b7-8dc3a7ce6fd8.ipc`,
and that further goes through the [file name optimization](./storage-location#optimized-file-name)
to produce the final node file name.

### Object Definition File Path

TODO


## Non-Standard File Paths

### Root Node File Path

With CoW, the root node file name is important because every change to the tree would create a new root node file,
and the root node file name can be used essentially as the version of the tree.

TrinityLake defines that each root node has a numeric version number,
and the root node is stored in a file name `_<version_number_binary_reversed>.ipc`.
The file name is persisted in storage as is without [optimization](./storage-location#optimized-file-name).
For example, the 100th version of the root node file would be stored with name `_00100110000000000000000000000000.ipc`.

### Root Node Latest Version Hint File Path

A file with name `_latest_hint.txt` is stored and marks the hint to the latest version of the TrinityLake tree root node file.
The file name is persisted in storage as is without [optimization](./storage-location#optimized-file-name)
The file contains a number that marks the presumably latest version of the tree root node, such as `100`.


### Lakehouse Definition File Path

should be `_lakehouse_def_` plus UUID plus `.binpb`
Loading

0 comments on commit 4dc5d2d

Please sign in to comment.