Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#1134] docs: Add user doc of expression #1991

Merged
merged 2 commits into from
Feb 3, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
135 changes: 135 additions & 0 deletions docs/expression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
---
title: "Expression system of Gravitino"
slug: /expression
date: 2024-02-02
keyword: expression function field literal reference
license: Copyright 2024 Datastrato Pvt Ltd. This software is licensed under the Apache License version 2.
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

This page introduces the expression system of Gravitino. Expressions are vital component of metadata definition, through expressions, you can define default values for columns and function arguments for [function partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning), [bucketing](./table-partitioning-bucketing-sort-order-indexes.md#table-bucketing), and sort term of [sort ordering](./table-partitioning-bucketing-sort-order-indexes.md#sort-ordering) in tables.
Gravitino expression system divides expressions into three basic parts: field reference, literal, and function. Function expressions can contain field references, literals, and other function expressions.

## Field reference

Field reference is a reference to a field in a table.
The following is an example of creating a field reference expression, demonstrating how to create a reference for the `student` field.

<Tabs>
<TabItem value="Json" label="Json">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it Json or json for value? I see that below in Java it is value="java".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After local testing, both can be rendered correctly.

This writing appears in other documents as well, should I change it to keep the styling consistent?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK if it can be rendered correctly. If you want to unify them you can do it in another PR.


```json
[
{
"type": "field",
"fieldName": [
"student"
]
}
]
```

</TabItem>
<TabItem value="java" label="Java">

```java
NamedReference field = NamedReference.field("student");
```

</TabItem>
</Tabs>

## Literal

Literal is a constant value.
The following is an example of creating a literal expression, demonstrating how to create three different data types of literal expressions for the value `1024`.

<Tabs>
<TabItem value="Json" label="Json">

```json
[
{
"type": "literal",
"dataType": "integer",
"value": "1024"
},
{
"type": "literal",
"dataType": "string",
"value": "1024"
},
{
"type": "literal",
"dataType": "decimal(10,2)",
"value": "1024"
}
]
```

</TabItem>
<TabItem value="java" label="Java">

```java
Literal<?>[] literals =
new Literal[] {
Literals.integerLiteral(1024),
Literals.stringLiteral("1024"),
Literals.decimalLiteral(Decimal.of("1024", 10, 2))
};
```

</TabItem>
</Tabs>

## Function expression

Function expression represents a function call with/without arguments. The arguments can be field references, literals, or other function expressions.
The following is an example of creating a function expression, demonstrating how to create function expressions for `rand()` and `date_trunc('year', birthday)`.

<Tabs>
<TabItem value="Json" label="Json">

```json
[
{
"type": "function",
"funcName": "rand",
"funcArgs": []
},
{
"type": "function",
"funcName": "date_trunc",
"funcArgs": [
{
"type": "literal",
"dataType": "string",
"value": "year"
},
{
"type": "field",
"fieldName": [
"birthday"
]
}
]
}
]
```

</TabItem>
<TabItem value="java" label="Java">

```java
FunctionExpression[] functionExpressions =
new FunctionExpression[] {
FunctionExpression.of("rand"),
FunctionExpression.of("date_trunc", Literals.stringLiteral("year"), NamedReference.field("birthday"))
};
```

</TabItem>
</Tabs>

89 changes: 27 additions & 62 deletions docs/table-partitioning-bucketing-sort-order-indexes.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,76 +22,41 @@ To create a partitioned table, you should provide the following two components t
The `score`, `createTime`, and `city` appearing in the table below refer to the field names in a table.
:::

| Partitioning strategy | Description | JSON example | Java example | Equivalent SQL semantics |
|-----------------------|----------------------------------------------------------------|------------------------------------------------------------------|--------------------------------------------------------|---------------------------------------|
| `identity` | Source value, unmodified. | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` |
| `hour` | Extract a timestamp hour, as hours from '1970-01-01 00:00:00'. | `{"strategy":"hour","fieldName":["createTime"]}` | `Transforms.hour("createTime")` | `PARTITION BY hour(createTime)` |
| `day` | Extract a date or timestamp day, as days from '1970-01-01'. | `{"strategy":"day","fieldName":["createTime"]}` | `Transforms.day("createTime")` | `PARTITION BY day(createTime)` |
| `month` | Extract a date or timestamp month, as months from '1970-01-01' | `{"strategy":"month","fieldName":["createTime"]}` | `Transforms.month("createTime")` | `PARTITION BY month(createTime)` |
| `year` | Extract a date or timestamp year, as years from 1970. | `{"strategy":"year","fieldName":["createTime"]}` | `Transforms.year("createTime")` | `PARTITION BY year(createTime)` |
| `bucket[N]` | Hash of value, mod N. | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` |
| `truncate[W]` | Value truncated to width W. | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` |
| `list` | Partition the table by a list value. | `{"strategy":"list","fieldNames":[["createTime"],["city"]]}` | `Transforms.list(new String[] {"createTime", "city"})` | `PARTITION BY list(createTime, city)` |
| `range` | Partition the table by a range value. | `{"strategy":"range","fieldName":["createTime"]}` | `Transforms.range("createTime")` | `PARTITION BY range(createTime)` |

As well as the strategies mentioned before, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["createTime"]}` is equivalent to `PARTITION BY toDate(createTime)` in SQL.
For complex functions, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java).
| Partitioning strategy | Description | JSON example | Java example | Equivalent SQL semantics |
|-----------------------|----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|---------------------------------------|
| `identity` | Source value, unmodified. | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` |
| `hour` | Extract a timestamp hour, as hours from '1970-01-01 00:00:00'. | `{"strategy":"hour","fieldName":["createTime"]}` | `Transforms.hour("createTime")` | `PARTITION BY hour(createTime)` |
| `day` | Extract a date or timestamp day, as days from '1970-01-01'. | `{"strategy":"day","fieldName":["createTime"]}` | `Transforms.day("createTime")` | `PARTITION BY day(createTime)` |
| `month` | Extract a date or timestamp month, as months from '1970-01-01' | `{"strategy":"month","fieldName":["createTime"]}` | `Transforms.month("createTime")` | `PARTITION BY month(createTime)` |
| `year` | Extract a date or timestamp year, as years from 1970. | `{"strategy":"year","fieldName":["createTime"]}` | `Transforms.year("createTime")` | `PARTITION BY year(createTime)` |
| `bucket[N]` | Hash of value, mod N. | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` |
| `truncate[W]` | Value truncated to width W. | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` |
| `list` | Partition the table by a list value. | `{"strategy":"list","fieldNames":[["createTime"],["city"]]}` | `Transforms.list(new String[] {"createTime", "city"})` | `PARTITION BY list(createTime, city)` |
| `range` | Partition the table by a range value. | `{"strategy":"range","fieldName":["createTime"]}` | `Transforms.range("createTime")` | `PARTITION BY range(createTime)` |
| `function` | Partition the table by function expression. | `{"strategy":"function","funcName":"toYYYYMM","funcArgs":[{"type":"field","fieldName":["VisitDate"]}]}` | `Transforms.apply("toYYYYMM", new Expression[]{NamedReference.field("VisitDate")})` | `PARTITION BY toYYYYMM(VisitDate)` |

:::note
For function partitioning, you should provide the function name and the function arguments. The function arguments must be an [expression](./expression.md).
:::

- Field names: It defines which fields Gravitino uses to partition the table.

- In some cases, you require other information. For example, if the partitioning strategy is `bucket`, you should provide the number of buckets; if the partitioning strategy is `truncate`, you should provide the width of the truncate.

The following is an example of creating a partitioned table:

<Tabs>
<TabItem value="Json" label="Json">

```json
[
{
"strategy": "identity",
"fieldName": [
"score"
]
}
]
```

</TabItem>
<TabItem value="java" label="Java">

```java
new Transform[] {
// Partition by score
Transforms.identity("score")
}
```

</TabItem>
</Tabs>


## Table bucketing

To create a bucketed table, you should use the following three components to construct a valid bucketed table.

- Strategy. It defines how Gravitino distributes table data across partitions.

| Bucket strategy | Description | JSON | Java |
|-----------------|-------------------------------------------------------------------------------------------------------------------------------|----------|------------------|
| hash | Bucket table using hash. Gravitino distributes table data into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` |
| range | Bucket table using range. Gravitino distributes table data into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` |
| even | Bucket table using even. Gravitino distributes table data, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` |

- Number. It defines how many buckets you use to bucket the table.
- Function arguments. It defines the arguments of the strategy, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments.

| Expression type | JSON example | Java example | Equivalent SQL semantics | Description |
|-----------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------|--------------------------|-----------------------------------|
| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | The field reference value `score` |
| function | `{"type":"function","functionName":"hour","fieldName":["dt"]}` | `new FuncExpressionDTO.Builder().withFunctionName("hour").withFunctionArgs("dt").build()` | `hour(dt)` | The function value `hour(dt)` |
| constant | `{"type":"literal","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder().withValue("10").withDataType(Types.IntegerType.get()).build()` | `10` | The integer literal `10` |
| Bucket strategy | Description | JSON | Java |
|-----------------|---------------------------------------------------------------------------------------------------------------------------|---------|------------------|
| hash | Bucket table using hash. Gravitino distributes table data into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` |
| range | Bucket table using range. Gravitino distributes table data into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` |
| even | Bucket table using even. Gravitino distributes table data, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` |

- number. It defines how many buckets you use to bucket the table.
- funcArgs. It defines the arguments of the strategy, the argument must be an [expression](./expression.md).

<Tabs>
<TabItem value="Json" label="Json">
Expand Down Expand Up @@ -127,20 +92,20 @@ To define a sorted order table, you should use the following three components to
- Direction. It defines in which direction Gravitino sorts the table. The default value is `ascending`.

| Direction | Description | JSON | Java |
|------------|---------------------------------------------| ------ | -------------------------- |
|------------|---------------------------------------------|--------|----------------------------|
| ascending | Sorted by a field or a function ascending. | `asc` | `SortDirection.ASCENDING` |
| descending | Sorted by a field or a function descending. | `desc` | `SortDirection.DESCENDING` |

- Null ordering. It describes how to handle null values when ordering

| Null ordering Type | Description | JSON | Java |
|--------------------|-----------------------------------------| ------------- | -------------------------- |
|--------------------|-----------------------------------------|---------------|----------------------------|
| null_first | Puts the null value in the first place. | `nulls_first` | `NullOrdering.NULLS_FIRST` |
| null_last | Puts the null value in the last place. | `nulls_last` | `NullOrdering.NULLS_LAST` |

Note: If the direction value is `ascending`, the default ordering value is `nulls_first` and if the direction value is `descending`, the default ordering value is `nulls_last`.

- Sort term. It shows which field or function Gravitino uses to sort the table, please refer to the `Function arguments` in the table bucketing section.
- sortTerm. It shows which field or function Gravitino uses to sort the table, must be an [expression](./expression.md).

<Tabs>
<TabItem value="Json" label="Json">
Expand Down Expand Up @@ -168,7 +133,7 @@ SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrderi


:::tip
**Not all catalogs may support those features.**. Please refer to the related document for more details.
**Not all catalogs may support those features**. Please refer to the related document for more details.
:::

The following is an example of creating a partitioned, bucketed table, and sorted order table:
Expand Down
Loading