From 92ecd5a9cebd38b1d04b8fd262470f384b59c059 Mon Sep 17 00:00:00 2001 From: draco Date: Sat, 28 Sep 2024 21:04:45 +0800 Subject: [PATCH 1/4] docs: add documentation about the local disk based WAL --- content/cn/docs/design/wal_on_disk.md | 142 +++++++++++++++++++++++++ content/en/docs/design/wal_on_disk.md | 144 ++++++++++++++++++++++++++ 2 files changed, 286 insertions(+) create mode 100644 content/cn/docs/design/wal_on_disk.md create mode 100644 content/en/docs/design/wal_on_disk.md diff --git a/content/cn/docs/design/wal_on_disk.md b/content/cn/docs/design/wal_on_disk.md new file mode 100644 index 00000000..ac15e5ab --- /dev/null +++ b/content/cn/docs/design/wal_on_disk.md @@ -0,0 +1,142 @@ +--- +title: "基于本地磁盘的 WAL" +--- + +## 架构 + +本节将介绍基于本地磁盘的单机版 WAL(Write-Ahead Log,以下简称日志)的实现。在此实现中,日志按 region 级别进行管理。 + +``` + ┌────────────────────────────┐ + │ HoraeDB │ + │ │ + │ ┌────────────────────────┐ │ + │ │ WAL │ │ ┌────────────────────────┐ + │ │ │ │ │ │ + │ │ ...... │ │ │ File System │ + │ │ │ │ │ │ + │ │ ┌────────────────────┐ │ │ manage │ ┌────────────────────┐ │ + Write ─────┼─┼─► Region ├─┼─┼─────────┼─► Region Dir │ │ + │ │ │ │ │ │ │ │ │ │ + Read ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 0 ├───┼─┼─┼─────────┼─┼─► Segment File 0 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ +Delete ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 1 ├───┼─┼─┼─────────┼─┼─► SegmenteFile 1 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ + │ │ │ ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 2 ├───┼─┼─┼─────────┼─┼─► SegmenteFile 2 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ + │ │ │ ...... │ │ │ │ │ ...... │ │ + │ │ └────────────────────┘ │ │ │ └────────────────────┘ │ + │ │ ...... │ │ │ ...... │ + │ └────────────────────────┘ │ └────────────────────────┘ + └────────────────────────────┘ +``` + +## 数据模型 + +### 文件路径 + +每个 region 都拥有一个目录,用于管理该 region 的所有 segment。目录名为 region 的 ID。每个 segment 的命名方式为 `segment_.wal`,ID 从 0 开始递增。 + +### Segment 的格式 + +一个 region 中所有表的日志都存储在 segments 中,并按照 sequence number 从小到大排列。segment 文件的结构如下: + +``` + Segment0 Segment1 +┌────────────┐ ┌────────────┐ +│ Magic Num │ │ Magic Num │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ .... +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ ... │ │ ... │ +│ │ │ │ +└────────────┘ └────────────┘ + segment_0.wal segment_1.wal +``` + +在内存中,每个 segment 还会存储一些额外的信息以供读写和删除操作使用: + +``` +pub struct Segment { + /// A hashmap storing both min and max sequence numbers of records within + /// this segment for each `TableId`. + table_ranges: HashMap, + + /// An optional vector of positions within the segment. + record_position: Vec, + + ... +} +``` + +### 日志格式 + +segment 中的日志格式如下: + +``` ++---------+--------+--------+------------+--------------+--------------+-------+ +| version | crc | length | table id | sequence num | value length | value | +| (u8) | (u32) | (u32) | (u64) | (u64) | (u32) | | ++---------+--------+--------+------------+--------------+--------------+-------+ +``` + +字段说明: + +1. `version`:日志版本号。 + +2. `crc`:用于确保数据一致性。计算从 table id 到该记录结束的 CRC 校验值。 + +3. `length`:从 table id 到该记录结束的字节数。 + +4. `table id`:表的唯一标识符。 + +5. `sequence num`:记录的序列号。 + +6. `value length`:value 的字节长度。 + +7. `value`:通用日志格式中的值。 + +日志中不存储 region ID,因为可以通过文件路径获取该信息。 + +## 主要流程 + +### 打开 Wal + +1. 识别 Wal 目录下的所有 region 目录。 + +2. 在每个 region 目录下,识别所有 segment 文件。 + +3. 打开每个 segment 文件,遍历其中的所有日志,记录其中每个日志开始和结束的偏移量和每个 `TableId` 在该 segment 中的最小和最大序列号,然后关闭文件。 + +4. 如果不存在 region 目录或目录下没有任何 segment 文件,则自动创建相应的目录和文件。 + +### 读日志 + +1. 根据 segment 的元数据,确定本次读取操作涉及的所有 segment。 +2. 按照 id 从小到大的顺序,依次打开这些 segment,将原始字节解码为日志。 + +### 写日志 + +1. 将待写入的日志序列化为字节数据,追加到 id 最大的 segment 文件中。 +2. 每个 segment 创建时预分配固定大小的 64MB,不会动态改变。当预分配的空间用完后,创建一个新的 segment,并切换到新的 segment 继续追加。 + +3. 每次追加后不会立即调用 flush;默认情况下,每写入十次或在 segment 文件关闭时才执行 flush。 + +4. 在内存中更新 segment 的元数据 `table_ranges`。 + +### 删除日志 + +假设需要将 id 为 `table_id` 的表中,序列号小于 seq_num 的日志标记为删除: + +1. 在内存中更新相关 segment 的 `table_ranges` 字段,将该表的最小序列号更新为 seq_num + 1。 + +2. 如果修改后,该表在此 segment 中的最小序列号大于最大序列号,则从 `table_ranges` 中删除该表。 + +3. 如果一个 segment 的 `table_ranges` 为空,且不是 id 最大的 segment,则删除该 segment 文件。 \ No newline at end of file diff --git a/content/en/docs/design/wal_on_disk.md b/content/en/docs/design/wal_on_disk.md new file mode 100644 index 00000000..319bad86 --- /dev/null +++ b/content/en/docs/design/wal_on_disk.md @@ -0,0 +1,144 @@ +--- +title: "WAL on Disk" +--- + +## Architecture + +This section introduces the implementation of a standalone Write-Ahead Log (WAL, hereinafter referred to as "the log") based on a local disk. In this implementation, the log is managed at the region level. + +``` + ┌────────────────────────────┐ + │ HoraeDB │ + │ │ + │ ┌────────────────────────┐ │ + │ │ WAL │ │ ┌────────────────────────┐ + │ │ │ │ │ │ + │ │ ...... │ │ │ File System │ + │ │ │ │ │ │ + │ │ ┌────────────────────┐ │ │ manage │ ┌────────────────────┐ │ + Write ─────┼─┼─► Region ├─┼─┼─────────┼─► Region Dir │ │ + │ │ │ │ │ │ │ │ │ │ + Read ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 0 ├───┼─┼─┼─────────┼─┼─► Segment File 0 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ +Delete ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 1 ├───┼─┼─┼─────────┼─┼─► Segment File 1 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ + │ │ │ ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 2 ├───┼─┼─┼─────────┼─┼─► Segment File 2 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ + │ │ │ ...... │ │ │ │ │ ...... │ │ + │ │ └────────────────────┘ │ │ │ └────────────────────┘ │ + │ │ ...... │ │ │ ...... │ + │ └────────────────────────┘ │ └────────────────────────┘ + └────────────────────────────┘ +``` + +## Data Model + +### File Paths + +Each region has its own directory to manage all segments for that region. The directory is named after the region's ID. Each segment is named using the format `segment_.wal`, with IDs starting from 0 and incrementing. + +### Segment Format + +Logs for all tables within a region are stored in segments, arranged in ascending order of sequence numbers. The structure of the segment files is as follows: + +``` + Segment0 Segment1 +┌────────────┐ ┌────────────┐ +│ Magic Num │ │ Magic Num │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ .... +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ ... │ │ ... │ +│ │ │ │ +└────────────┘ └────────────┘ + segment_0.wal segment_1.wal +``` + +In memory, each segment stores additional information used for read, write, and delete operations: + +```rust +pub struct Segment { + /// A hashmap storing both min and max sequence numbers of records within + /// this segment for each `TableId`. + table_ranges: HashMap, + + /// An optional vector of positions within the segment. + record_position: Vec, + + ... +} +``` + +### Log Format + +The log format within a segment is as follows: + +``` ++---------+--------+--------+------------+--------------+--------------+-------+ +| version | crc | length | table id | sequence num | value length | value | +| (u8) | (u32) | (u32) | (u64) | (u64) | (u32) | | ++---------+--------+--------+------------+--------------+--------------+-------+ +``` + +Field Descriptions: + +1. `version`: Log version number. + +2. `crc`: Used to ensure data consistency. Computes the CRC checksum from the table id to the end of the record. + +3. `length`: The number of bytes from the table id to the end of the record. + +4. `table id`: The unique identifier of the table. + +5. `sequence num`: The sequence number of the record. + +6. `value length`: The byte length of the value. + +7. `value`: The value in the general log format. + +The region ID is not stored in the log because it can be obtained from the file path. + +## Main Processes + +### Opening the WAL + +1. Identify all region directories under the WAL directory. + +2. In each region directory, identify all segment files. + +3. Open each segment file, traverse all logs within it, record the start and end offsets of each log, and record the minimum and maximum sequence numbers of each `TableId` in the segment, then close the file. + +4. If there is no region directory or there are no segment files under the directory, automatically create the corresponding directory and files. + +### Reading Logs + +1. Based on the metadata of the segments, determine all segments involved in the current read operation. + +2. Open these segments in order of their IDs from smallest to largest, and decode the raw bytes into logs. + +### Writing Logs + +1. Serialize the logs to be written into byte data and append them to the segment file with the largest ID. + +2. When a segment is created, it pre-allocates a fixed size of 64MB and will not change dynamically. When the pre-allocated space is used up, a new segment is created, and appending continues in the new segment. + +3. After each append, `flush` is not called immediately; by default, `flush` is performed every ten writes or when the segment file is closed. + +4. Update the segment's metadata `table_ranges` in memory. + +### Deleting Logs + +Suppose logs in the table with ID `table_id` and sequence numbers less than `seq_num` need to be marked as deleted: + +1. Update the `table_ranges` field of the relevant segments in memory, updating the minimum sequence number of the table to `seq_num + 1`. + +2. If after modification, the minimum sequence number of the table in this segment is greater than the maximum sequence number, remove the table from `table_ranges`. + +3. If a segment's `table_ranges` is empty and it is not the segment with the largest ID, delete the segment file. \ No newline at end of file From fe589b7cd48bd5d5fa23ec8b4481b2bdc312fc8c Mon Sep 17 00:00:00 2001 From: draco Date: Fri, 18 Oct 2024 14:44:57 +0800 Subject: [PATCH 2/4] update --- content/cn/docs/design/wal_on_disk.md | 48 +++++++++++++-------------- content/en/docs/design/wal_on_disk.md | 48 +++++++++++++-------------- 2 files changed, 48 insertions(+), 48 deletions(-) diff --git a/content/cn/docs/design/wal_on_disk.md b/content/cn/docs/design/wal_on_disk.md index ac15e5ab..2e439cf0 100644 --- a/content/cn/docs/design/wal_on_disk.md +++ b/content/cn/docs/design/wal_on_disk.md @@ -7,10 +7,10 @@ title: "基于本地磁盘的 WAL" 本节将介绍基于本地磁盘的单机版 WAL(Write-Ahead Log,以下简称日志)的实现。在此实现中,日志按 region 级别进行管理。 ``` - ┌────────────────────────────┐ - │ HoraeDB │ - │ │ - │ ┌────────────────────────┐ │ + ┌────────────────────────────┐ + │ HoraeDB │ + │ │ + │ ┌────────────────────────┐ │ │ │ WAL │ │ ┌────────────────────────┐ │ │ │ │ │ │ │ │ ...... │ │ │ File System │ @@ -31,7 +31,7 @@ Delete ─────┼─┼─► ┌──────────── │ │ └────────────────────┘ │ │ │ └────────────────────┘ │ │ │ ...... │ │ │ ...... │ │ └────────────────────────┘ │ └────────────────────────┘ - └────────────────────────────┘ + └────────────────────────────┘ ``` ## 数据模型 @@ -45,20 +45,20 @@ Delete ─────┼─┼─► ┌──────────── 一个 region 中所有表的日志都存储在 segments 中,并按照 sequence number 从小到大排列。segment 文件的结构如下: ``` - Segment0 Segment1 -┌────────────┐ ┌────────────┐ -│ Magic Num │ │ Magic Num │ -├────────────┤ ├────────────┤ -│ Record │ │ Record │ -├────────────┤ ├────────────┤ -│ Record │ │ Record │ + Segment0 Segment1 +┌────────────┐ ┌────────────┐ +│ Magic Num │ │ Magic Num │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ ├────────────┤ ├────────────┤ .... -│ Record │ │ Record │ -├────────────┤ ├────────────┤ -│ ... │ │ ... │ -│ │ │ │ -└────────────┘ └────────────┘ - segment_0.wal segment_1.wal +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ ... │ │ ... │ +│ │ │ │ +└────────────┘ └────────────┘ + seg_0 seg_1 ``` 在内存中,每个 segment 还会存储一些额外的信息以供读写和删除操作使用: @@ -71,7 +71,7 @@ pub struct Segment { /// An optional vector of positions within the segment. record_position: Vec, - + ... } ``` @@ -81,10 +81,10 @@ pub struct Segment { segment 中的日志格式如下: ``` -+---------+--------+--------+------------+--------------+--------------+-------+ -| version | crc | length | table id | sequence num | value length | value | -| (u8) | (u32) | (u32) | (u64) | (u64) | (u32) | | -+---------+--------+--------+------------+--------------+--------------+-------+ ++---------+--------+------------+--------------+--------------+-------+ +| version | crc | table id | sequence num | value length | value | +| (u8) | (u32) | (u64) | (u64) | (u32) |(bytes)| ++---------+--------+------------+--------------+--------------+-------+ ``` 字段说明: @@ -139,4 +139,4 @@ segment 中的日志格式如下: 2. 如果修改后,该表在此 segment 中的最小序列号大于最大序列号,则从 `table_ranges` 中删除该表。 -3. 如果一个 segment 的 `table_ranges` 为空,且不是 id 最大的 segment,则删除该 segment 文件。 \ No newline at end of file +3. 如果一个 segment 的 `table_ranges` 为空,且不是 id 最大的 segment,则删除该 segment 文件。 diff --git a/content/en/docs/design/wal_on_disk.md b/content/en/docs/design/wal_on_disk.md index 319bad86..02b09bea 100644 --- a/content/en/docs/design/wal_on_disk.md +++ b/content/en/docs/design/wal_on_disk.md @@ -7,10 +7,10 @@ title: "WAL on Disk" This section introduces the implementation of a standalone Write-Ahead Log (WAL, hereinafter referred to as "the log") based on a local disk. In this implementation, the log is managed at the region level. ``` - ┌────────────────────────────┐ - │ HoraeDB │ - │ │ - │ ┌────────────────────────┐ │ + ┌────────────────────────────┐ + │ HoraeDB │ + │ │ + │ ┌────────────────────────┐ │ │ │ WAL │ │ ┌────────────────────────┐ │ │ │ │ │ │ │ │ ...... │ │ │ File System │ @@ -31,7 +31,7 @@ Delete ─────┼─┼─► ┌──────────── │ │ └────────────────────┘ │ │ │ └────────────────────┘ │ │ │ ...... │ │ │ ...... │ │ └────────────────────────┘ │ └────────────────────────┘ - └────────────────────────────┘ + └────────────────────────────┘ ``` ## Data Model @@ -45,20 +45,20 @@ Each region has its own directory to manage all segments for that region. The di Logs for all tables within a region are stored in segments, arranged in ascending order of sequence numbers. The structure of the segment files is as follows: ``` - Segment0 Segment1 -┌────────────┐ ┌────────────┐ -│ Magic Num │ │ Magic Num │ -├────────────┤ ├────────────┤ -│ Record │ │ Record │ -├────────────┤ ├────────────┤ -│ Record │ │ Record │ + Segment0 Segment1 +┌────────────┐ ┌────────────┐ +│ Magic Num │ │ Magic Num │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ ├────────────┤ ├────────────┤ .... -│ Record │ │ Record │ -├────────────┤ ├────────────┤ -│ ... │ │ ... │ -│ │ │ │ -└────────────┘ └────────────┘ - segment_0.wal segment_1.wal +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ ... │ │ ... │ +│ │ │ │ +└────────────┘ └────────────┘ + seg_0 seg_1 ``` In memory, each segment stores additional information used for read, write, and delete operations: @@ -71,7 +71,7 @@ pub struct Segment { /// An optional vector of positions within the segment. record_position: Vec, - + ... } ``` @@ -81,10 +81,10 @@ pub struct Segment { The log format within a segment is as follows: ``` -+---------+--------+--------+------------+--------------+--------------+-------+ -| version | crc | length | table id | sequence num | value length | value | -| (u8) | (u32) | (u32) | (u64) | (u64) | (u32) | | -+---------+--------+--------+------------+--------------+--------------+-------+ ++---------+--------+------------+--------------+--------------+-------+ +| version | crc | table id | sequence num | value length | value | +| (u8) | (u32) | (u64) | (u64) | (u32) |(bytes)| ++---------+--------+------------+--------------+--------------+-------+ ``` Field Descriptions: @@ -141,4 +141,4 @@ Suppose logs in the table with ID `table_id` and sequence numbers less than `seq 2. If after modification, the minimum sequence number of the table in this segment is greater than the maximum sequence number, remove the table from `table_ranges`. -3. If a segment's `table_ranges` is empty and it is not the segment with the largest ID, delete the segment file. \ No newline at end of file +3. If a segment's `table_ranges` is empty and it is not the segment with the largest ID, delete the segment file. From 8b7f1c41fab1c59902ca905bdf95b67165fd5d29 Mon Sep 17 00:00:00 2001 From: draco Date: Fri, 18 Oct 2024 14:46:54 +0800 Subject: [PATCH 3/4] remove length --- content/cn/docs/design/wal_on_disk.md | 10 ++++------ content/en/docs/design/wal_on_disk.md | 10 ++++------ 2 files changed, 8 insertions(+), 12 deletions(-) diff --git a/content/cn/docs/design/wal_on_disk.md b/content/cn/docs/design/wal_on_disk.md index 2e439cf0..0d15af47 100644 --- a/content/cn/docs/design/wal_on_disk.md +++ b/content/cn/docs/design/wal_on_disk.md @@ -93,15 +93,13 @@ segment 中的日志格式如下: 2. `crc`:用于确保数据一致性。计算从 table id 到该记录结束的 CRC 校验值。 -3. `length`:从 table id 到该记录结束的字节数。 +3. `table id`:表的唯一标识符。 -4. `table id`:表的唯一标识符。 +4. `sequence num`:记录的序列号。 -5. `sequence num`:记录的序列号。 +5. `value length`:value 的字节长度。 -6. `value length`:value 的字节长度。 - -7. `value`:通用日志格式中的值。 +6. `value`:通用日志格式中的值。 日志中不存储 region ID,因为可以通过文件路径获取该信息。 diff --git a/content/en/docs/design/wal_on_disk.md b/content/en/docs/design/wal_on_disk.md index 02b09bea..bc3918c6 100644 --- a/content/en/docs/design/wal_on_disk.md +++ b/content/en/docs/design/wal_on_disk.md @@ -93,15 +93,13 @@ Field Descriptions: 2. `crc`: Used to ensure data consistency. Computes the CRC checksum from the table id to the end of the record. -3. `length`: The number of bytes from the table id to the end of the record. +3. `table id`: The unique identifier of the table. -4. `table id`: The unique identifier of the table. +4. `sequence num`: The sequence number of the record. -5. `sequence num`: The sequence number of the record. +5. `value length`: The byte length of the value. -6. `value length`: The byte length of the value. - -7. `value`: The value in the general log format. +6. `value`: The value in the general log format. The region ID is not stored in the log because it can be obtained from the file path. From d4b819e2ec8822c1f83775b413390357deda11cf Mon Sep 17 00:00:00 2001 From: Jiacai Liu Date: Fri, 18 Oct 2024 15:03:26 +0800 Subject: [PATCH 4/4] Apply suggestions from code review --- content/cn/docs/design/wal_on_disk.md | 2 +- content/en/docs/design/wal_on_disk.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/cn/docs/design/wal_on_disk.md b/content/cn/docs/design/wal_on_disk.md index 0d15af47..c3aa97ba 100644 --- a/content/cn/docs/design/wal_on_disk.md +++ b/content/cn/docs/design/wal_on_disk.md @@ -38,7 +38,7 @@ Delete ─────┼─┼─► ┌──────────── ### 文件路径 -每个 region 都拥有一个目录,用于管理该 region 的所有 segment。目录名为 region 的 ID。每个 segment 的命名方式为 `segment_.wal`,ID 从 0 开始递增。 +每个 region 都拥有一个目录,用于管理该 region 的所有 segment。目录名为 region 的 ID。每个 segment 的命名方式为 `seg_`,ID 从 0 开始递增。 ### Segment 的格式 diff --git a/content/en/docs/design/wal_on_disk.md b/content/en/docs/design/wal_on_disk.md index bc3918c6..8f9f3a96 100644 --- a/content/en/docs/design/wal_on_disk.md +++ b/content/en/docs/design/wal_on_disk.md @@ -38,7 +38,7 @@ Delete ─────┼─┼─► ┌──────────── ### File Paths -Each region has its own directory to manage all segments for that region. The directory is named after the region's ID. Each segment is named using the format `segment_.wal`, with IDs starting from 0 and incrementing. +Each region has its own directory to manage all segments for that region. The directory is named after the region's ID. Each segment is named using the format `seg_`, with IDs starting from 0 and incrementing. ### Segment Format