Skip to content

Commit

Permalink
feat: clarify behaviour of SetRel operations (#708)
Browse files Browse the repository at this point in the history
Clarified language of existing SetRel operations to distinguish ALL and
DISTINCT behaviour. Added language around ALL versions of operations to better
specify how many versions of a record should appear in the relation output.

feat: added SET_OP_MINUS_PRIMARY_ALL
feat: added SET_OP_INTERSECTION_MULTISET_ALL

These operations corresponds to SQL:
* EXCEPT ALL
* INTERSECT ALL
  • Loading branch information
kadinrabo authored Sep 27, 2024
1 parent 65a7d38 commit f796521
Show file tree
Hide file tree
Showing 2 changed files with 29 additions and 16 deletions.
2 changes: 2 additions & 0 deletions proto/substrait/algebra.proto
Original file line number Diff line number Diff line change
Expand Up @@ -345,9 +345,11 @@ message SetRel {
enum SetOp {
SET_OP_UNSPECIFIED = 0;
SET_OP_MINUS_PRIMARY = 1;
SET_OP_MINUS_PRIMARY_ALL = 7;
SET_OP_MINUS_MULTISET = 2;
SET_OP_INTERSECTION_PRIMARY = 3;
SET_OP_INTERSECTION_MULTISET = 4;
SET_OP_INTERSECTION_MULTISET_ALL = 8;
SET_OP_UNION_DISTINCT = 5;
SET_OP_UNION_ALL = 6;
}
Expand Down
43 changes: 27 additions & 16 deletions site/docs/relations/logical_relations.md
Original file line number Diff line number Diff line change
Expand Up @@ -268,14 +268,23 @@ The set operation encompasses several set-level operations that support combinin

The set operation type determines both the records that are emitted and the type of the output record.

| Property | Description | Output Nullability
| ----------------------- | ------------------------------------------------------------------------------------------------------------- | ----------------------------- |
| Minus (Primary) | Returns all records from the primary input excluding any matching records from secondary inputs. | The same as the primary input.
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | The same as the primary input.
| Intersection (Primary) | Returns all records from the primary input that match at least one record from *any* secondary inputs. | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
| Intersection (Multiset) | Returns all records from the primary input that match at least one record from *all* secondary inputs. | If a field is required in any of the inputs, it is required in the output.
| Union Distinct | Returns all the records from each set, removing any rows that are duplicated (within or across sets). | If a field is nullable in any of the inputs, it is nullable in the output.
| Union All | Returns all records from each set, allowing duplicates. | If a field is nullable in any of the inputs, it is nullable in the output. |
For some set operations, whether a specific record is included in the output and if it appears more than once depends on the number of times it occurs across all inputs. In the following table, treat:
* m: the number of time a records occurs in the primary input (p)
* n1: the number of times a record occurs in the 1st secondary input (s1)
* n2: the number of times a record occurs in the 2nd secondary input (s2)
* ...
* n: the number of times a record occurs in the nth secondary input

| Operation | Description | Examples | Output Nullability
|-----------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------| -----------------------------
| Minus (Primary) | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicates.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result.<br/>This operation maps to SQL EXCEPT DISTINCT. | MINUS<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {1, 2}<br/>&nbsp;&nbsp;s2: {3}<br/>YIELDS<br/>{4} | The same as the primary input.
| Minus (Primary All) | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies.<br/>This operation maps to SQL EXCEPT ALL. | MINUS ALL<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 3}<br/>&nbsp;&nbsp;s1: {1, 2, 3, 4}<br/>&nbsp;&nbsp;s2: {3}<br/>YIELDS<br/>{2, 3, 3} | The same as the primary input.
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs.<br/>This operation does not have a direct SQL mapping. | MINUS MULTISET<br/>&nbsp;&nbsp;p: {1, 2, 3, 4}<br/>&nbsp;&nbsp;s1: {1, 2}<br/>&nbsp;&nbsp;s2: {1, 2, 3}<br/>YIELDS<br/>{3, 4} | The same as the primary input.
| Intersection (Primary) | Returns all records from the primary input that are present in any secondary input, removing duplicates.<br/>This operation does not have a direct SQL mapping. | INTERSECT<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {1, 2, 3, 5}<br/>&nbsp;&nbsp;s2: {2, 3, 6}<br/>YIELDS<br/>{1, 2, 3} | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
| Intersection (Multiset) | Returns all records from the primary input that match at least one record from *all* secondary inputs.<br/>This operation maps to SQL INTERSECT DISTINCT | INTERSECT MULTISET<br/>&nbsp;&nbsp;p: {1, 2, 3, 4}<br/>&nbsp;&nbsp;s1: {2, 3}<br/>&nbsp;&nbsp;s2: {3, 4}<br/>YIELDS<br/>{3} | If a field is required in any of the inputs, it is required in the output.
| Intersection (Multiset All) | Returns all records from the primary input that are present in every secondary input.<br/>For each specific record returned, the output contains min(m, n1, n2, …, n) copies.<br/>This operation maps to SQL INTERSECT ALL. | INTERSECT ALL<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {1, 2, 3, 3, 5}<br/>&nbsp;&nbsp;s2: {2, 3, 3, 6}<br/>YIELDS<br/>{2, 3, 3} | If a field is required in any of the inputs, it is required in the output.
| Union Distinct | Returns all records from each set, removing duplicates.<br/>This operation maps to SQL UNION DISTINCT. | UNION<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {2, 3, 5}<br/>&nbsp;&nbsp;s2: {1, 6}<br/>YIELDS<br/>{1, 2, 3, 4, 5, 6} | If a field is nullable in any of the inputs, it is nullable in the output.
| Union All | Returns all records from all inputs.<br/>For each specific record returned, the output contains (m + n1 + n2 + … + n) copies.<br/>This operation maps to SQL UNION ALL. | UNION ALL<br/>&nbsp;&nbsp;p: {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;s1: {2, 3, 5}<br/>&nbsp;&nbsp;s2: {1, 6}<br/>YIELDS<br/>{1, 2, 2, 3, 3, 3, 4, 2, 3, 5, 1, 6} | If a field is nullable in any of the inputs, it is nullable in the output.

Note that for set operations, NULL matches NULL. That is
```
Expand All @@ -294,14 +303,16 @@ Input 3: (R, N, R, N, R, N, R, N) Secondary Input

The output type is as follows for the various operations

| Property | Output Type
| ----------------------- | -----------------------------------------------------------------------------------------------------
| Minus (Primary) | (R, R, R, R, N, N, N, N)
| Minus (Multiset) | (R, R, R, R, N, N, N, N)
| Intersection (Primary) | (R, R, R, R, R, N, N, N)
| Intersection (Multiset) | (R, R, R, R, R, R, R, N)
| Union Distinct | (R, N, N, N, N, N, N, N)
| Union All | (R, N, N, N, N, N, N, N)
| Property | Output Type
|-----------------------------| -----------------------------------------------------------------------------------------------------
| Minus (Primary) | (R, R, R, R, N, N, N, N)
| Minus (Primary All) | (R, R, R, R, N, N, N, N)
| Minus (Multiset) | (R, R, R, R, N, N, N, N)
| Intersection (Primary) | (R, R, R, R, R, N, N, N)
| Intersection (Multiset) | (R, R, R, R, R, R, R, N)
| Intersection (Multiset All) | (R, R, R, R, R, R, R, N)
| Union Distinct | (R, N, N, N, N, N, N, N)
| Union All | (R, N, N, N, N, N, N, N)


=== "SetRel Message"
Expand Down

0 comments on commit f796521

Please sign in to comment.