-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Iceberg Extension Documentation (#314)
* add ice_berg docu * Update src/content/docs/extensions/iceberg.mdx Co-authored-by: Guodong Jin <[email protected]> * Update src/content/docs/extensions/iceberg.mdx Co-authored-by: Guodong Jin <[email protected]> * restructure * restructure * restructure * update table * update table * Apply suggestions from code review * update table * Fixes --------- Co-authored-by: Guodong Jin <[email protected]> Co-authored-by: Prashanth Rao <[email protected]> Co-authored-by: prrao87 <[email protected]>
- Loading branch information
1 parent
890397d
commit 6bf35c9
Showing
1 changed file
with
245 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,245 @@ | ||
--- | ||
title: "Iceberg" | ||
--- | ||
|
||
The `iceberg` extension adds support for scanning and copying from the [Apache Iceberg format](https://iceberg.apache.org/). | ||
Iceberg is an open-source table format originally developed at Netflix for large-scale analytical datasets. | ||
|
||
The Iceberg functionality is not available by default, so you would first need to install the `iceberg` | ||
extension by running the following commands: | ||
|
||
```sql | ||
INSTALL iceberg; | ||
LOAD EXTENSION iceberg; | ||
``` | ||
|
||
At a high level, the `iceberg` extension provides the following functionality: | ||
|
||
- Scanning an Iceberg table | ||
- Copying an Iceberg table into a node table | ||
- Accessing the Iceberg metadata | ||
- Listing the Iceberg snapshots | ||
|
||
## Usage | ||
|
||
To run the examples below, download the [iceberg_tables.zip](https://kuzudb.com/data/iceberg-extension/iceberg_tables.zip) file, unzip it | ||
and place the contents in the `/tmp` directory. | ||
|
||
### Scan the Iceberg table | ||
|
||
`LOAD FROM` is a Cypher query that scans a file or object element by element, but doesn’t actually | ||
move the data into a Kùzu table. | ||
|
||
Here's how you would scan an Iceberg table: | ||
|
||
```cypher | ||
LOAD FROM '/tmp/iceberg_tables/university' (file_format='iceberg', allow_moved_paths=true) RETURN *; | ||
``` | ||
``` | ||
┌────────────┬──────┬──────────┐ | ||
| University | Rank | Funding | | ||
├────────────┼──────┼──────────┤ | ||
| Stanford | 2 | 250.300 | | ||
| Yale | 6 | 190.700 | | ||
| Harvard | 1 | 210.500 | | ||
| Cambridge | 5 | 280.200 | | ||
| MIT | 3 | 170.000 | | ||
| Oxford | 4 | 300.000 | | ||
└────────────┴──────┴──────────┘ | ||
``` | ||
|
||
:::note[Notes] | ||
- The `file_format` parameter is used to explicitly specify the file format of the given file instead of | ||
letting Kùzu autodetect the file format at runtime. When scanning from the Iceberg table, | ||
the `file_format` option must be provided since Kùzu is not capable of autodetecting Iceberg tables. | ||
- The `allow_moved_paths` option ensures that proper path resolution is performed, which allows scanning | ||
Iceberg tables that are moved from their original location. | ||
::: | ||
|
||
### Copy the Iceberg table into a node table | ||
You can then use a `COPY FROM` statement to directly copy the contents of the Iceberg table into a node table. | ||
|
||
```cypher | ||
CREATE NODE TABLE university (name STRING, age INT64, PRIMARY KEY(age)); | ||
COPY student FROM '/tmp/iceberg_tables/person_table' (file_format='iceberg', allow_moved_paths=true) | ||
``` | ||
|
||
Just like above in `LOAD FROM`, the `file_format` parameter is mandatory when specifying the `COPY FROM` clause as well. | ||
|
||
Result: | ||
```cypher | ||
// Create the node table | ||
CREATE NODE TABLE university (name STRING, rank INT64, fund double, PRIMARY KEY(name)); | ||
``` | ||
``` | ||
┌─────────────────────────────────────┐ | ||
│ result │ | ||
│ STRING │ | ||
├─────────────────────────────────────┤ | ||
│ Table university has been created. │ | ||
└─────────────────────────────────────┘ | ||
``` | ||
```cypher | ||
COPY university FROM '/tmp/iceberg_tables/university' (file_format='iceberg', allow_moved_paths=true); | ||
``` | ||
``` | ||
┌─────────────────────────────────────────────────────┐ | ||
│ result │ | ||
│ STRING │ | ||
├─────────────────────────────────────────────────────┤ | ||
│ 6 tuples have been copied to the university table. │ | ||
└─────────────────────────────────────────────────────┘ | ||
``` | ||
|
||
### Access Iceberg metadata | ||
At the heart of Iceberg’s table structure is the metadata, which tracks everything from the schema, to partition information | ||
and snapshots of the table's state. This is particularly useful for understanding the underlying structure, tracking data | ||
changes, and debugging issues in Iceberg datasets. | ||
|
||
The `ICEBERG_METADATA` function lists the metadata files for an Iceberg table via the `CALL` function in Kùzu. | ||
|
||
:::caution[Note] | ||
Ensure you use `:=` operator to set the variable in `CALL` function, not the `=` operator. | ||
The `:=` operator is required within `CALL` functions in Kùzu. | ||
::: | ||
|
||
```cypher | ||
CALL ICEBERG_METADATA( | ||
'/tmp/iceberg_tables/lineitem_iceberg', | ||
allow_moved_paths := true | ||
) | ||
RETURN *; | ||
``` | ||
|
||
``` | ||
┌──────────────────────────┬──────────────────────────┬──────────────────┬─────────┬──────────┬──────────────────────────┬─────────────┬──────────────┐ | ||
│ manifest_path │ manifest_sequence_number │ manifest_content │ status │ content │ file_path │ file_format │ record_count │ | ||
│ STRING │ INT64 │ STRING │ STRING │ STRING │ STRING │ STRING │ INT64 │ | ||
├──────────────────────────┼──────────────────────────┼──────────────────┼─────────┼──────────┼──────────────────────────┼─────────────┼──────────────┤ | ||
│ lineitem_iceberg/meta... │ 2 │ DATA │ ADDED │ EXISTING │ lineitem_iceberg/data... │ PARQUET │ 51793 │ | ||
│ lineitem_iceberg/meta... │ 2 │ DATA │ DELETED │ EXISTING │ lineitem_iceberg/data... │ PARQUET │ 60175 │ | ||
└──────────────────────────┴──────────────────────────┴──────────────────┴─────────┴──────────┴──────────────────────────┴─────────────┴──────────────┘ | ||
``` | ||
|
||
### List Iceberg snapshots | ||
Iceberg tables maintain a series of snapshots, which are consistent views of the table at a specific point in time. | ||
Snapshots are the core of Iceberg’s versioning system, allowing you to track, query, and manage changes to your table over time. | ||
|
||
The `ICEBERG_SNAPSHOTS` function lists the snapshots for an Iceberg table via the `CALL` function. | ||
Note that for snapshots, you do not need to specify the `allow_moved_paths` option. | ||
|
||
```cypher | ||
CALL ICEBERG_SNAPSHOTS('/tmp/iceberg_tables/lineitem_iceberg') RETURN *; | ||
``` | ||
|
||
``` | ||
┌─────────────────┬─────────────────────┬─────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────┐ | ||
│ sequence_number │ snapshot_id │ timestamp_ms │ manifest_list │ | ||
│ UINT64 │ UINT64 │ TIMESTAMP │ STRING │ | ||
├─────────────────┼─────────────────────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────┤ | ||
│ 1 │ 3776207205136740581 │ 2023-02-15 15:07:54.504 │ lineitem_iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro │ | ||
│ 2 │ 7635660646343998149 │ 2023-02-15 15:08:14.73 │ lineitem_iceberg/metadata/snap-7635660646343998149-1-10eaca8a-1e1c-421e-ad6d-b232e5ee23d3.avro │ | ||
└─────────────────┴─────────────────────┴─────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────┘ | ||
``` | ||
|
||
### Optional parameters | ||
|
||
The following optional parameters are supported in the Iceberg extension: | ||
|
||
<div class="scroll-table"> | ||
|
||
| Parameter | Type | Default | Description | | ||
|-------------------------------|-----------|--------------|---------------------------------------------------------------------------------| | ||
| allow_moved_paths | BOOLEAN | `false` | Allows scanning Iceberg tables that are not located in their original directory | | ||
| metadata_compression_codec | STRING | `''` | Specifies the compression code used for the metadata files (currenly only supports `gzip`) | | ||
| version | STRING | `'?'` | Provides an explicit Iceberg version number, if not provided, the Iceberg version number would be determined from `version-hint.txt`| | ||
| version_name_format | STRING | `'v%s%s.metadata.json,%s%s.metadata.json'` | Provides the regular expression to find the correct metadata data file | | ||
|
||
</div> | ||
|
||
More details on usage are provided below. | ||
|
||
#### Select metadata version | ||
By default, the `iceberg` extension will look for a `version-hint.text` file to identify the proper metadata version to use. | ||
This can be overridden by explicitly supplying a version number via the `version` parameter to Iceberg table functions. | ||
|
||
Example: | ||
```cypher | ||
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg' ( | ||
file_format='iceberg', | ||
allow_moved_paths=true, | ||
version='2' | ||
) | ||
RETURN *; | ||
``` | ||
|
||
#### Change metadata compression codec | ||
By default, this extension will look for both `v{version}.metadata.json` and `{version}.metadata.json` files for metadata, or `v{version}.gz.metadata.json` and `{version}.gz.metadata.json` when `metadata_compression_codec = 'gzip'` is specified. | ||
Other compression codecs are NOT supported. | ||
|
||
```cypher | ||
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_gz' ( | ||
file_format='iceberg', | ||
allow_moved_paths=true, | ||
metadata_compression_codec = 'gzip' | ||
) | ||
RETURN *; | ||
``` | ||
|
||
#### Change metadata name format | ||
To change the metadata naming format, use the `version_name_format` option, for example, if your metadata is named as `rev-2.metadata.json`, set this option as `version_name_format = 'rev-%s.metadata.json` to make sure the metadata file can be found successfully. | ||
|
||
```cypher | ||
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_alter_name' ( | ||
file_format='iceberg', | ||
allow_moved_paths=true, | ||
version_name_format = 'rev-%s.metadata.json' | ||
) | ||
RETURN *; | ||
``` | ||
|
||
### Access an Iceberg table hosted on S3 | ||
Kùzu also supports scanning/copying a Iceberg table hosted on S3 in the same way as from a local file system. | ||
Before reading and writing from S3, you have to configure the connection using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement. | ||
|
||
#### Supported options | ||
|
||
| Option name | Description | | ||
|----------|----------| | ||
| `s3_access_key_id` | S3 access key id | | ||
| `s3_secret_access_key` | S3 secret access key | | ||
| `s3_endpoint` | S3 endpoint | | ||
| `s3_url_style` | Uses [S3 url style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html) (should either be vhost or path) | | ||
| `s3_region` | S3 region | | ||
|
||
|
||
#### Requirements on the S3 server API | ||
|
||
| Feature | Required S3 API features | | ||
|----------|----------| | ||
| Public file reads | HTTP Range request | | ||
| Private file reads | Secret key authentication| | ||
|
||
#### Scan Iceberg table from S3 | ||
Reading or scanning a Iceberg table that's on S3 is as simple as reading from regular files: | ||
|
||
```cypher | ||
LOAD FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_moved_paths=true) | ||
RETURN * | ||
``` | ||
|
||
#### Copy Iceberg table hosted on S3 into a local node table | ||
|
||
Copying from Iceberg tables on S3 is also as simple as copying from regular files: | ||
|
||
```cypher | ||
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID)); | ||
COPY student FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_moved_paths=true) | ||
``` | ||
|
||
## Limitations | ||
|
||
When using the Iceberg extension in Kùzu, keep the following limitations in mind. | ||
|
||
- Writing (i.e., exporting to) Iceberg tables is currently not supported. | ||
- We currently do not support scanning/copying nested data (i.e., of type `STRUCT`) in the Iceberg table columns. |