Skip to content

Commit

Permalink
Add Iceberg Extension Documentation (#314)
Browse files Browse the repository at this point in the history
* add ice_berg docu

* Update src/content/docs/extensions/iceberg.mdx

Co-authored-by: Guodong Jin <[email protected]>

* Update src/content/docs/extensions/iceberg.mdx

Co-authored-by: Guodong Jin <[email protected]>

* restructure

* restructure

* restructure

* update table

* update table

* Apply suggestions from code review

* update table

* Fixes

---------

Co-authored-by: Guodong Jin <[email protected]>
Co-authored-by: Prashanth Rao <[email protected]>
Co-authored-by: prrao87 <[email protected]>
  • Loading branch information
4 people authored Dec 19, 2024
1 parent 890397d commit 6bf35c9
Showing 1 changed file with 245 additions and 0 deletions.
245 changes: 245 additions & 0 deletions src/content/docs/extensions/iceberg.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,245 @@
---
title: "Iceberg"
---

The `iceberg` extension adds support for scanning and copying from the [Apache Iceberg format](https://iceberg.apache.org/).
Iceberg is an open-source table format originally developed at Netflix for large-scale analytical datasets.

The Iceberg functionality is not available by default, so you would first need to install the `iceberg`
extension by running the following commands:

```sql
INSTALL iceberg;
LOAD EXTENSION iceberg;
```

At a high level, the `iceberg` extension provides the following functionality:

- Scanning an Iceberg table
- Copying an Iceberg table into a node table
- Accessing the Iceberg metadata
- Listing the Iceberg snapshots

## Usage

To run the examples below, download the [iceberg_tables.zip](https://kuzudb.com/data/iceberg-extension/iceberg_tables.zip) file, unzip it
and place the contents in the `/tmp` directory.

### Scan the Iceberg table

`LOAD FROM` is a Cypher query that scans a file or object element by element, but doesn’t actually
move the data into a Kùzu table.

Here's how you would scan an Iceberg table:

```cypher
LOAD FROM '/tmp/iceberg_tables/university' (file_format='iceberg', allow_moved_paths=true) RETURN *;
```
```
┌────────────┬──────┬──────────┐
| University | Rank | Funding |
├────────────┼──────┼──────────┤
| Stanford | 2 | 250.300 |
| Yale | 6 | 190.700 |
| Harvard | 1 | 210.500 |
| Cambridge | 5 | 280.200 |
| MIT | 3 | 170.000 |
| Oxford | 4 | 300.000 |
└────────────┴──────┴──────────┘
```

:::note[Notes]
- The `file_format` parameter is used to explicitly specify the file format of the given file instead of
letting Kùzu autodetect the file format at runtime. When scanning from the Iceberg table,
the `file_format` option must be provided since Kùzu is not capable of autodetecting Iceberg tables.
- The `allow_moved_paths` option ensures that proper path resolution is performed, which allows scanning
Iceberg tables that are moved from their original location.
:::

### Copy the Iceberg table into a node table
You can then use a `COPY FROM` statement to directly copy the contents of the Iceberg table into a node table.

```cypher
CREATE NODE TABLE university (name STRING, age INT64, PRIMARY KEY(age));
COPY student FROM '/tmp/iceberg_tables/person_table' (file_format='iceberg', allow_moved_paths=true)
```

Just like above in `LOAD FROM`, the `file_format` parameter is mandatory when specifying the `COPY FROM` clause as well.

Result:
```cypher
// Create the node table
CREATE NODE TABLE university (name STRING, rank INT64, fund double, PRIMARY KEY(name));
```
```
┌─────────────────────────────────────┐
│ result │
│ STRING │
├─────────────────────────────────────┤
│ Table university has been created. │
└─────────────────────────────────────┘
```
```cypher
COPY university FROM '/tmp/iceberg_tables/university' (file_format='iceberg', allow_moved_paths=true);
```
```
┌─────────────────────────────────────────────────────┐
│ result │
│ STRING │
├─────────────────────────────────────────────────────┤
│ 6 tuples have been copied to the university table. │
└─────────────────────────────────────────────────────┘
```

### Access Iceberg metadata
At the heart of Iceberg’s table structure is the metadata, which tracks everything from the schema, to partition information
and snapshots of the table's state. This is particularly useful for understanding the underlying structure, tracking data
changes, and debugging issues in Iceberg datasets.

The `ICEBERG_METADATA` function lists the metadata files for an Iceberg table via the `CALL` function in Kùzu.

:::caution[Note]
Ensure you use `:=` operator to set the variable in `CALL` function, not the `=` operator.
The `:=` operator is required within `CALL` functions in Kùzu.
:::

```cypher
CALL ICEBERG_METADATA(
'/tmp/iceberg_tables/lineitem_iceberg',
allow_moved_paths := true
)
RETURN *;
```

```
┌──────────────────────────┬──────────────────────────┬──────────────────┬─────────┬──────────┬──────────────────────────┬─────────────┬──────────────┐
│ manifest_path │ manifest_sequence_number │ manifest_content │ status │ content │ file_path │ file_format │ record_count │
│ STRING │ INT64 │ STRING │ STRING │ STRING │ STRING │ STRING │ INT64 │
├──────────────────────────┼──────────────────────────┼──────────────────┼─────────┼──────────┼──────────────────────────┼─────────────┼──────────────┤
│ lineitem_iceberg/meta... │ 2 │ DATA │ ADDED │ EXISTING │ lineitem_iceberg/data... │ PARQUET │ 51793 │
│ lineitem_iceberg/meta... │ 2 │ DATA │ DELETED │ EXISTING │ lineitem_iceberg/data... │ PARQUET │ 60175 │
└──────────────────────────┴──────────────────────────┴──────────────────┴─────────┴──────────┴──────────────────────────┴─────────────┴──────────────┘
```

### List Iceberg snapshots
Iceberg tables maintain a series of snapshots, which are consistent views of the table at a specific point in time.
Snapshots are the core of Iceberg’s versioning system, allowing you to track, query, and manage changes to your table over time.

The `ICEBERG_SNAPSHOTS` function lists the snapshots for an Iceberg table via the `CALL` function.
Note that for snapshots, you do not need to specify the `allow_moved_paths` option.

```cypher
CALL ICEBERG_SNAPSHOTS('/tmp/iceberg_tables/lineitem_iceberg') RETURN *;
```

```
┌─────────────────┬─────────────────────┬─────────────────────────┬────────────────────────────────────────────────────────────────────────────────────────────────┐
│ sequence_number │ snapshot_id │ timestamp_ms │ manifest_list │
│ UINT64 │ UINT64 │ TIMESTAMP │ STRING │
├─────────────────┼─────────────────────┼─────────────────────────┼────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 1 │ 3776207205136740581 │ 2023-02-15 15:07:54.504 │ lineitem_iceberg/metadata/snap-3776207205136740581-1-cf3d0be5-cf70-453d-ad8f-48fdc412e608.avro │
│ 2 │ 7635660646343998149 │ 2023-02-15 15:08:14.73 │ lineitem_iceberg/metadata/snap-7635660646343998149-1-10eaca8a-1e1c-421e-ad6d-b232e5ee23d3.avro │
└─────────────────┴─────────────────────┴─────────────────────────┴────────────────────────────────────────────────────────────────────────────────────────────────┘
```

### Optional parameters

The following optional parameters are supported in the Iceberg extension:

<div class="scroll-table">

| Parameter | Type | Default | Description |
|-------------------------------|-----------|--------------|---------------------------------------------------------------------------------|
| allow_moved_paths | BOOLEAN | `false` | Allows scanning Iceberg tables that are not located in their original directory |
| metadata_compression_codec | STRING | `''` | Specifies the compression code used for the metadata files (currenly only supports `gzip`) |
| version | STRING | `'?'` | Provides an explicit Iceberg version number, if not provided, the Iceberg version number would be determined from `version-hint.txt`|
| version_name_format | STRING | `'v%s%s.metadata.json,%s%s.metadata.json'` | Provides the regular expression to find the correct metadata data file |

</div>

More details on usage are provided below.

#### Select metadata version
By default, the `iceberg` extension will look for a `version-hint.text` file to identify the proper metadata version to use.
This can be overridden by explicitly supplying a version number via the `version` parameter to Iceberg table functions.

Example:
```cypher
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg' (
file_format='iceberg',
allow_moved_paths=true,
version='2'
)
RETURN *;
```

#### Change metadata compression codec
By default, this extension will look for both `v{version}.metadata.json` and `{version}.metadata.json` files for metadata, or `v{version}.gz.metadata.json` and `{version}.gz.metadata.json` when `metadata_compression_codec = 'gzip'` is specified.
Other compression codecs are NOT supported.

```cypher
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_gz' (
file_format='iceberg',
allow_moved_paths=true,
metadata_compression_codec = 'gzip'
)
RETURN *;
```

#### Change metadata name format
To change the metadata naming format, use the `version_name_format` option, for example, if your metadata is named as `rev-2.metadata.json`, set this option as `version_name_format = 'rev-%s.metadata.json` to make sure the metadata file can be found successfully.

```cypher
LOAD FROM '/tmp/iceberg_tables/lineitem_iceberg_alter_name' (
file_format='iceberg',
allow_moved_paths=true,
version_name_format = 'rev-%s.metadata.json'
)
RETURN *;
```

### Access an Iceberg table hosted on S3
Kùzu also supports scanning/copying a Iceberg table hosted on S3 in the same way as from a local file system.
Before reading and writing from S3, you have to configure the connection using the [CALL](https://kuzudb.com/docusaurus/cypher/configuration) statement.

#### Supported options

| Option name | Description |
|----------|----------|
| `s3_access_key_id` | S3 access key id |
| `s3_secret_access_key` | S3 secret access key |
| `s3_endpoint` | S3 endpoint |
| `s3_url_style` | Uses [S3 url style](https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html) (should either be vhost or path) |
| `s3_region` | S3 region |


#### Requirements on the S3 server API

| Feature | Required S3 API features |
|----------|----------|
| Public file reads | HTTP Range request |
| Private file reads | Secret key authentication|

#### Scan Iceberg table from S3
Reading or scanning a Iceberg table that's on S3 is as simple as reading from regular files:

```cypher
LOAD FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_moved_paths=true)
RETURN *
```

#### Copy Iceberg table hosted on S3 into a local node table

Copying from Iceberg tables on S3 is also as simple as copying from regular files:

```cypher
CREATE NODE TABLE student (name STRING, ID INT64, PRIMARY KEY(ID));
COPY student FROM 's3://path/to/iceberg_table' (file_format='iceberg', allow_moved_paths=true)
```

## Limitations

When using the Iceberg extension in Kùzu, keep the following limitations in mind.

- Writing (i.e., exporting to) Iceberg tables is currently not supported.
- We currently do not support scanning/copying nested data (i.e., of type `STRUCT`) in the Iceberg table columns.

0 comments on commit 6bf35c9

Please sign in to comment.