diff --git a/content/cn/docs/design/wal_on_disk.md b/content/cn/docs/design/wal_on_disk.md new file mode 100644 index 00000000..c3aa97ba --- /dev/null +++ b/content/cn/docs/design/wal_on_disk.md @@ -0,0 +1,140 @@ +--- +title: "基于本地磁盘的 WAL" +--- + +## 架构 + +本节将介绍基于本地磁盘的单机版 WAL(Write-Ahead Log,以下简称日志)的实现。在此实现中,日志按 region 级别进行管理。 + +``` + ┌────────────────────────────┐ + │ HoraeDB │ + │ │ + │ ┌────────────────────────┐ │ + │ │ WAL │ │ ┌────────────────────────┐ + │ │ │ │ │ │ + │ │ ...... │ │ │ File System │ + │ │ │ │ │ │ + │ │ ┌────────────────────┐ │ │ manage │ ┌────────────────────┐ │ + Write ─────┼─┼─► Region ├─┼─┼─────────┼─► Region Dir │ │ + │ │ │ │ │ │ │ │ │ │ + Read ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 0 ├───┼─┼─┼─────────┼─┼─► Segment File 0 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ +Delete ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 1 ├───┼─┼─┼─────────┼─┼─► SegmenteFile 1 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ + │ │ │ ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 2 ├───┼─┼─┼─────────┼─┼─► SegmenteFile 2 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ + │ │ │ ...... │ │ │ │ │ ...... │ │ + │ │ └────────────────────┘ │ │ │ └────────────────────┘ │ + │ │ ...... │ │ │ ...... │ + │ └────────────────────────┘ │ └────────────────────────┘ + └────────────────────────────┘ +``` + +## 数据模型 + +### 文件路径 + +每个 region 都拥有一个目录,用于管理该 region 的所有 segment。目录名为 region 的 ID。每个 segment 的命名方式为 `seg_`,ID 从 0 开始递增。 + +### Segment 的格式 + +一个 region 中所有表的日志都存储在 segments 中,并按照 sequence number 从小到大排列。segment 文件的结构如下: + +``` + Segment0 Segment1 +┌────────────┐ ┌────────────┐ +│ Magic Num │ │ Magic Num │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ .... +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ ... │ │ ... │ +│ │ │ │ +└────────────┘ └────────────┘ + seg_0 seg_1 +``` + +在内存中,每个 segment 还会存储一些额外的信息以供读写和删除操作使用: + +``` +pub struct Segment { + /// A hashmap storing both min and max sequence numbers of records within + /// this segment for each `TableId`. + table_ranges: HashMap, + + /// An optional vector of positions within the segment. + record_position: Vec, + + ... +} +``` + +### 日志格式 + +segment 中的日志格式如下: + +``` ++---------+--------+------------+--------------+--------------+-------+ +| version | crc | table id | sequence num | value length | value | +| (u8) | (u32) | (u64) | (u64) | (u32) |(bytes)| ++---------+--------+------------+--------------+--------------+-------+ +``` + +字段说明: + +1. `version`:日志版本号。 + +2. `crc`:用于确保数据一致性。计算从 table id 到该记录结束的 CRC 校验值。 + +3. `table id`:表的唯一标识符。 + +4. `sequence num`:记录的序列号。 + +5. `value length`:value 的字节长度。 + +6. `value`:通用日志格式中的值。 + +日志中不存储 region ID,因为可以通过文件路径获取该信息。 + +## 主要流程 + +### 打开 Wal + +1. 识别 Wal 目录下的所有 region 目录。 + +2. 在每个 region 目录下,识别所有 segment 文件。 + +3. 打开每个 segment 文件,遍历其中的所有日志,记录其中每个日志开始和结束的偏移量和每个 `TableId` 在该 segment 中的最小和最大序列号,然后关闭文件。 + +4. 如果不存在 region 目录或目录下没有任何 segment 文件,则自动创建相应的目录和文件。 + +### 读日志 + +1. 根据 segment 的元数据,确定本次读取操作涉及的所有 segment。 +2. 按照 id 从小到大的顺序,依次打开这些 segment,将原始字节解码为日志。 + +### 写日志 + +1. 将待写入的日志序列化为字节数据,追加到 id 最大的 segment 文件中。 +2. 每个 segment 创建时预分配固定大小的 64MB,不会动态改变。当预分配的空间用完后,创建一个新的 segment,并切换到新的 segment 继续追加。 + +3. 每次追加后不会立即调用 flush;默认情况下,每写入十次或在 segment 文件关闭时才执行 flush。 + +4. 在内存中更新 segment 的元数据 `table_ranges`。 + +### 删除日志 + +假设需要将 id 为 `table_id` 的表中,序列号小于 seq_num 的日志标记为删除: + +1. 在内存中更新相关 segment 的 `table_ranges` 字段,将该表的最小序列号更新为 seq_num + 1。 + +2. 如果修改后,该表在此 segment 中的最小序列号大于最大序列号,则从 `table_ranges` 中删除该表。 + +3. 如果一个 segment 的 `table_ranges` 为空,且不是 id 最大的 segment,则删除该 segment 文件。 diff --git a/content/en/docs/design/wal_on_disk.md b/content/en/docs/design/wal_on_disk.md new file mode 100644 index 00000000..8f9f3a96 --- /dev/null +++ b/content/en/docs/design/wal_on_disk.md @@ -0,0 +1,142 @@ +--- +title: "WAL on Disk" +--- + +## Architecture + +This section introduces the implementation of a standalone Write-Ahead Log (WAL, hereinafter referred to as "the log") based on a local disk. In this implementation, the log is managed at the region level. + +``` + ┌────────────────────────────┐ + │ HoraeDB │ + │ │ + │ ┌────────────────────────┐ │ + │ │ WAL │ │ ┌────────────────────────┐ + │ │ │ │ │ │ + │ │ ...... │ │ │ File System │ + │ │ │ │ │ │ + │ │ ┌────────────────────┐ │ │ manage │ ┌────────────────────┐ │ + Write ─────┼─┼─► Region ├─┼─┼─────────┼─► Region Dir │ │ + │ │ │ │ │ │ │ │ │ │ + Read ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 0 ├───┼─┼─┼─────────┼─┼─► Segment File 0 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ +Delete ─────┼─┼─► ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 1 ├───┼─┼─┼─────────┼─┼─► Segment File 1 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ + │ │ │ ┌────────────┐ │ │ │ mmap │ │ ┌────────────────┐ │ │ + │ │ │ │ Segment 2 ├───┼─┼─┼─────────┼─┼─► Segment File 2 │ │ │ + │ │ │ └────────────┘ │ │ │ │ │ └────────────────┘ │ │ + │ │ │ ...... │ │ │ │ │ ...... │ │ + │ │ └────────────────────┘ │ │ │ └────────────────────┘ │ + │ │ ...... │ │ │ ...... │ + │ └────────────────────────┘ │ └────────────────────────┘ + └────────────────────────────┘ +``` + +## Data Model + +### File Paths + +Each region has its own directory to manage all segments for that region. The directory is named after the region's ID. Each segment is named using the format `seg_`, with IDs starting from 0 and incrementing. + +### Segment Format + +Logs for all tables within a region are stored in segments, arranged in ascending order of sequence numbers. The structure of the segment files is as follows: + +``` + Segment0 Segment1 +┌────────────┐ ┌────────────┐ +│ Magic Num │ │ Magic Num │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ Record │ │ Record │ +├────────────┤ ├────────────┤ .... +│ Record │ │ Record │ +├────────────┤ ├────────────┤ +│ ... │ │ ... │ +│ │ │ │ +└────────────┘ └────────────┘ + seg_0 seg_1 +``` + +In memory, each segment stores additional information used for read, write, and delete operations: + +```rust +pub struct Segment { + /// A hashmap storing both min and max sequence numbers of records within + /// this segment for each `TableId`. + table_ranges: HashMap, + + /// An optional vector of positions within the segment. + record_position: Vec, + + ... +} +``` + +### Log Format + +The log format within a segment is as follows: + +``` ++---------+--------+------------+--------------+--------------+-------+ +| version | crc | table id | sequence num | value length | value | +| (u8) | (u32) | (u64) | (u64) | (u32) |(bytes)| ++---------+--------+------------+--------------+--------------+-------+ +``` + +Field Descriptions: + +1. `version`: Log version number. + +2. `crc`: Used to ensure data consistency. Computes the CRC checksum from the table id to the end of the record. + +3. `table id`: The unique identifier of the table. + +4. `sequence num`: The sequence number of the record. + +5. `value length`: The byte length of the value. + +6. `value`: The value in the general log format. + +The region ID is not stored in the log because it can be obtained from the file path. + +## Main Processes + +### Opening the WAL + +1. Identify all region directories under the WAL directory. + +2. In each region directory, identify all segment files. + +3. Open each segment file, traverse all logs within it, record the start and end offsets of each log, and record the minimum and maximum sequence numbers of each `TableId` in the segment, then close the file. + +4. If there is no region directory or there are no segment files under the directory, automatically create the corresponding directory and files. + +### Reading Logs + +1. Based on the metadata of the segments, determine all segments involved in the current read operation. + +2. Open these segments in order of their IDs from smallest to largest, and decode the raw bytes into logs. + +### Writing Logs + +1. Serialize the logs to be written into byte data and append them to the segment file with the largest ID. + +2. When a segment is created, it pre-allocates a fixed size of 64MB and will not change dynamically. When the pre-allocated space is used up, a new segment is created, and appending continues in the new segment. + +3. After each append, `flush` is not called immediately; by default, `flush` is performed every ten writes or when the segment file is closed. + +4. Update the segment's metadata `table_ranges` in memory. + +### Deleting Logs + +Suppose logs in the table with ID `table_id` and sequence numbers less than `seq_num` need to be marked as deleted: + +1. Update the `table_ranges` field of the relevant segments in memory, updating the minimum sequence number of the table to `seq_num + 1`. + +2. If after modification, the minimum sequence number of the table in this segment is greater than the maximum sequence number, remove the table from `table_ranges`. + +3. If a segment's `table_ranges` is empty and it is not the segment with the largest ID, delete the segment file.