diff --git a/docs/expression.md b/docs/expression.md new file mode 100644 index 00000000000..6f2379616de --- /dev/null +++ b/docs/expression.md @@ -0,0 +1,135 @@ +--- +title: "Expression system of Gravitino" +slug: /expression +date: 2024-02-02 +keyword: expression function field literal reference +license: Copyright 2024 Datastrato Pvt Ltd. This software is licensed under the Apache License version 2. +--- + +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; + +This page introduces the expression system of Gravitino. Expressions are vital component of metadata definition, through expressions, you can define default values for columns and function arguments for [function partitioning](./table-partitioning-bucketing-sort-order-indexes.md#table-partitioning), [bucketing](./table-partitioning-bucketing-sort-order-indexes.md#table-bucketing), and sort term of [sort ordering](./table-partitioning-bucketing-sort-order-indexes.md#sort-ordering) in tables. +Gravitino expression system divides expressions into three basic parts: field reference, literal, and function. Function expressions can contain field references, literals, and other function expressions. + +## Field reference + +Field reference is a reference to a field in a table. +The following is an example of creating a field reference expression, demonstrating how to create a reference for the `student` field. + + + + +```json +[ + { + "type": "field", + "fieldName": [ + "student" + ] + } +] +``` + + + + +```java +NamedReference field = NamedReference.field("student"); +``` + + + + +## Literal + +Literal is a constant value. +The following is an example of creating a literal expression, demonstrating how to create three different data types of literal expressions for the value `1024`. + + + + +```json +[ + { + "type": "literal", + "dataType": "integer", + "value": "1024" + }, + { + "type": "literal", + "dataType": "string", + "value": "1024" + }, + { + "type": "literal", + "dataType": "decimal(10,2)", + "value": "1024" + } +] +``` + + + + +```java +Literal[] literals = + new Literal[] { + Literals.integerLiteral(1024), + Literals.stringLiteral("1024"), + Literals.decimalLiteral(Decimal.of("1024", 10, 2)) + }; +``` + + + + +## Function expression + +Function expression represents a function call with/without arguments. The arguments can be field references, literals, or other function expressions. +The following is an example of creating a function expression, demonstrating how to create function expressions for `rand()` and `date_trunc('year', birthday)`. + + + + +```json +[ + { + "type": "function", + "funcName": "rand", + "funcArgs": [] + }, + { + "type": "function", + "funcName": "date_trunc", + "funcArgs": [ + { + "type": "literal", + "dataType": "string", + "value": "year" + }, + { + "type": "field", + "fieldName": [ + "birthday" + ] + } + ] + } +] +``` + + + + +```java +FunctionExpression[] functionExpressions = + new FunctionExpression[] { + FunctionExpression.of("rand"), + FunctionExpression.of("date_trunc", Literals.stringLiteral("year"), NamedReference.field("birthday")) + }; +``` + + + + diff --git a/docs/table-partitioning-bucketing-sort-order-indexes.md b/docs/table-partitioning-bucketing-sort-order-indexes.md index 2d76543e6a6..d7895ae9505 100644 --- a/docs/table-partitioning-bucketing-sort-order-indexes.md +++ b/docs/table-partitioning-bucketing-sort-order-indexes.md @@ -22,76 +22,41 @@ To create a partitioned table, you should provide the following two components t The `score`, `createTime`, and `city` appearing in the table below refer to the field names in a table. ::: -| Partitioning strategy | Description | JSON example | Java example | Equivalent SQL semantics | -|-----------------------|----------------------------------------------------------------|------------------------------------------------------------------|--------------------------------------------------------|---------------------------------------| -| `identity` | Source value, unmodified. | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | -| `hour` | Extract a timestamp hour, as hours from '1970-01-01 00:00:00'. | `{"strategy":"hour","fieldName":["createTime"]}` | `Transforms.hour("createTime")` | `PARTITION BY hour(createTime)` | -| `day` | Extract a date or timestamp day, as days from '1970-01-01'. | `{"strategy":"day","fieldName":["createTime"]}` | `Transforms.day("createTime")` | `PARTITION BY day(createTime)` | -| `month` | Extract a date or timestamp month, as months from '1970-01-01' | `{"strategy":"month","fieldName":["createTime"]}` | `Transforms.month("createTime")` | `PARTITION BY month(createTime)` | -| `year` | Extract a date or timestamp year, as years from 1970. | `{"strategy":"year","fieldName":["createTime"]}` | `Transforms.year("createTime")` | `PARTITION BY year(createTime)` | -| `bucket[N]` | Hash of value, mod N. | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | -| `truncate[W]` | Value truncated to width W. | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | -| `list` | Partition the table by a list value. | `{"strategy":"list","fieldNames":[["createTime"],["city"]]}` | `Transforms.list(new String[] {"createTime", "city"})` | `PARTITION BY list(createTime, city)` | -| `range` | Partition the table by a range value. | `{"strategy":"range","fieldName":["createTime"]}` | `Transforms.range("createTime")` | `PARTITION BY range(createTime)` | - -As well as the strategies mentioned before, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["createTime"]}` is equivalent to `PARTITION BY toDate(createTime)` in SQL. -For complex functions, please refer to [FunctionPartitioningDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/partitions/FunctionPartitioningDTO.java). +| Partitioning strategy | Description | JSON example | Java example | Equivalent SQL semantics | +|-----------------------|----------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|---------------------------------------| +| `identity` | Source value, unmodified. | `{"strategy":"identity","fieldName":["score"]}` | `Transforms.identity("score")` | `PARTITION BY score` | +| `hour` | Extract a timestamp hour, as hours from '1970-01-01 00:00:00'. | `{"strategy":"hour","fieldName":["createTime"]}` | `Transforms.hour("createTime")` | `PARTITION BY hour(createTime)` | +| `day` | Extract a date or timestamp day, as days from '1970-01-01'. | `{"strategy":"day","fieldName":["createTime"]}` | `Transforms.day("createTime")` | `PARTITION BY day(createTime)` | +| `month` | Extract a date or timestamp month, as months from '1970-01-01' | `{"strategy":"month","fieldName":["createTime"]}` | `Transforms.month("createTime")` | `PARTITION BY month(createTime)` | +| `year` | Extract a date or timestamp year, as years from 1970. | `{"strategy":"year","fieldName":["createTime"]}` | `Transforms.year("createTime")` | `PARTITION BY year(createTime)` | +| `bucket[N]` | Hash of value, mod N. | `{"strategy":"bucket","numBuckets":10,"fieldNames":[["score"]]}` | `Transforms.bucket(10, "score")` | `PARTITION BY bucket(10, score)` | +| `truncate[W]` | Value truncated to width W. | `{"strategy":"truncate","width":20,"fieldName":["score"]}` | `Transforms.truncate(20, "score")` | `PARTITION BY truncate(20, score)` | +| `list` | Partition the table by a list value. | `{"strategy":"list","fieldNames":[["createTime"],["city"]]}` | `Transforms.list(new String[] {"createTime", "city"})` | `PARTITION BY list(createTime, city)` | +| `range` | Partition the table by a range value. | `{"strategy":"range","fieldName":["createTime"]}` | `Transforms.range("createTime")` | `PARTITION BY range(createTime)` | +| `function` | Partition the table by function expression. | `{"strategy":"function","funcName":"toYYYYMM","funcArgs":[{"type":"field","fieldName":["VisitDate"]}]}` | `Transforms.apply("toYYYYMM", new Expression[]{NamedReference.field("VisitDate")})` | `PARTITION BY toYYYYMM(VisitDate)` | + +:::note +For function partitioning, you should provide the function name and the function arguments. The function arguments must be an [expression](./expression.md). +::: - Field names: It defines which fields Gravitino uses to partition the table. - In some cases, you require other information. For example, if the partitioning strategy is `bucket`, you should provide the number of buckets; if the partitioning strategy is `truncate`, you should provide the width of the truncate. -The following is an example of creating a partitioned table: - - - - -```json -[ - { - "strategy": "identity", - "fieldName": [ - "score" - ] - } -] -``` - - - - -```java -new Transform[] { - // Partition by score - Transforms.identity("score") - } -``` - - - - - ## Table bucketing To create a bucketed table, you should use the following three components to construct a valid bucketed table. - Strategy. It defines how Gravitino distributes table data across partitions. -| Bucket strategy | Description | JSON | Java | -|-----------------|-------------------------------------------------------------------------------------------------------------------------------|----------|------------------| -| hash | Bucket table using hash. Gravitino distributes table data into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` | -| range | Bucket table using range. Gravitino distributes table data into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` | -| even | Bucket table using even. Gravitino distributes table data, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` | - -- Number. It defines how many buckets you use to bucket the table. -- Function arguments. It defines the arguments of the strategy, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments. - -| Expression type | JSON example | Java example | Equivalent SQL semantics | Description | -|-----------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------|--------------------------|-----------------------------------| -| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | The field reference value `score` | -| function | `{"type":"function","functionName":"hour","fieldName":["dt"]}` | `new FuncExpressionDTO.Builder().withFunctionName("hour").withFunctionArgs("dt").build()` | `hour(dt)` | The function value `hour(dt)` | -| constant | `{"type":"literal","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder().withValue("10").withDataType(Types.IntegerType.get()).build()` | `10` | The integer literal `10` | +| Bucket strategy | Description | JSON | Java | +|-----------------|---------------------------------------------------------------------------------------------------------------------------|---------|------------------| +| hash | Bucket table using hash. Gravitino distributes table data into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` | +| range | Bucket table using range. Gravitino distributes table data into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` | +| even | Bucket table using even. Gravitino distributes table data, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` | +- number. It defines how many buckets you use to bucket the table. +- funcArgs. It defines the arguments of the strategy, the argument must be an [expression](./expression.md). @@ -127,20 +92,20 @@ To define a sorted order table, you should use the following three components to - Direction. It defines in which direction Gravitino sorts the table. The default value is `ascending`. | Direction | Description | JSON | Java | -|------------|---------------------------------------------| ------ | -------------------------- | +|------------|---------------------------------------------|--------|----------------------------| | ascending | Sorted by a field or a function ascending. | `asc` | `SortDirection.ASCENDING` | | descending | Sorted by a field or a function descending. | `desc` | `SortDirection.DESCENDING` | - Null ordering. It describes how to handle null values when ordering | Null ordering Type | Description | JSON | Java | -|--------------------|-----------------------------------------| ------------- | -------------------------- | +|--------------------|-----------------------------------------|---------------|----------------------------| | null_first | Puts the null value in the first place. | `nulls_first` | `NullOrdering.NULLS_FIRST` | | null_last | Puts the null value in the last place. | `nulls_last` | `NullOrdering.NULLS_LAST` | Note: If the direction value is `ascending`, the default ordering value is `nulls_first` and if the direction value is `descending`, the default ordering value is `nulls_last`. -- Sort term. It shows which field or function Gravitino uses to sort the table, please refer to the `Function arguments` in the table bucketing section. +- sortTerm. It shows which field or function Gravitino uses to sort the table, must be an [expression](./expression.md). @@ -168,7 +133,7 @@ SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrderi :::tip -**Not all catalogs may support those features.**. Please refer to the related document for more details. +**Not all catalogs may support those features**. Please refer to the related document for more details. ::: The following is an example of creating a partitioned, bucketed table, and sorted order table: