apache · gianm · Mar 8, 2023 · Dec 6, 2022 · Dec 6, 2022 · Dec 6, 2022
diff --git a/docs/multi-stage-query/index.md b/docs/multi-stage-query/index.md
@@ -71,4 +71,4 @@ To use [EXTERN](reference.md#extern), you need READ permission on the resource n
 
 - [Read about key concepts](./concepts.md) to learn more about how SQL-based ingestion and multi-stage queries work.
 - [Check out the examples](./examples.md) to see SQL-based ingestion in action.
-- [Explore the Query view](../operations/web-console.md) to get started in the web console.
+- [Explore the Query view](../operations/web-console.md) to get started in the web console.
diff --git a/docs/multi-stage-query/reference.md b/docs/multi-stage-query/reference.md
@@ -198,13 +198,99 @@ The following table lists the context parameters for the MSQ task engine:
 | `maxNumTasks` | SELECT, INSERT, REPLACE<br /><br />The maximum total number of tasks to launch, including the controller task. The lowest possible value for this setting is 2: one controller and one worker. All tasks must be able to launch simultaneously. If they cannot, the query returns a `TaskStartTimeout` error code after approximately 10 minutes.<br /><br />May also be provided as `numTasks`. If both are present, `maxNumTasks` takes priority.| 2 |
 | `taskAssignment` | SELECT, INSERT, REPLACE<br /><br />Determines how many tasks to use. Possible values include: <ul><li>`max`: Uses as many tasks as possible, up to `maxNumTasks`.</li><li>`auto`: When file sizes can be determined through directory listing (for example: local files, S3, GCS, HDFS) uses as few tasks as possible without exceeding 10 GiB or 10,000 files per task, unless exceeding these limits is necessary to stay within `maxNumTasks`. When file sizes cannot be determined through directory listing (for example: http), behaves the same as `max`.</li></ul> | `max` |
 | `finalizeAggregations` | SELECT, INSERT, REPLACE<br /><br />Determines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see [SQL aggregation functions](../querying/sql-aggregations.md). | true |
+| `sqlJoinAlgorithm` | SELECT, INSERT, REPLACE<br /><br />Algorithm to use for JOIN. Use `broadcast` (the default) for broadcast hash join or `sortMerge` for sort-merge join. Affects all JOIN operations in the query. See [Joins](#joins) for more details. | `broadcast` |
 | `rowsInMemory` | INSERT or REPLACE<br /><br />Maximum number of rows to store in memory at once before flushing to disk during the segment generation process. Ignored for non-INSERT queries. In most cases, use the default value. You may need to override the default if you run into one of the [known issues](./known-issues.md) around memory usage. | 100,000 |
 | `segmentSortOrder` | INSERT or REPLACE<br /><br />Normally, Druid sorts rows in individual segments using `__time` first, followed by the [CLUSTERED BY](#clustered-by) clause. When you set `segmentSortOrder`, Druid sorts rows in segments using this column list first, followed by the CLUSTERED BY order.<br /><br />You provide the column list as comma-separated values or as a JSON array in string form. If your query includes `__time`, then this list must begin with `__time`. For example, consider an INSERT query that uses `CLUSTERED BY country` and has `segmentSortOrder` set to `__time,city`. Within each time chunk, Druid assigns rows to segments based on `country`, and then within each of those segments, Druid sorts those rows by `__time` first, then `city`, then `country`. | empty list |
 | `maxParseExceptions`| SELECT, INSERT, REPLACE<br /><br />Maximum number of parse exceptions that are ignored while executing the query before it stops with `TooManyWarningsFault`. To ignore all the parse exceptions, set the value to -1.| 0 |
 | `rowsPerSegment` | INSERT or REPLACE<br /><br />The number of rows per segment to target. The actual number of rows per segment may be somewhat higher or lower than this number. In most cases, use the default. For general information about sizing rows per segment, see [Segment Size Optimization](../operations/segment-optimization.md). | 3,000,000 |
 | `indexSpec` | INSERT or REPLACE<br /><br />An [`indexSpec`](../ingestion/ingestion-spec.md#indexspec) to use when generating segments. May be a JSON string or object. See [Front coding](../ingestion/ingestion-spec.md#front-coding) for details on configuring an `indexSpec` with front coding. | See [`indexSpec`](../ingestion/ingestion-spec.md#indexspec). |
 | `clusterStatisticsMergeMode` | Whether to use parallel or sequential mode for merging of the worker sketches. Can be `PARALLEL`, `SEQUENTIAL` or `AUTO`. See [Sketch Merging Mode](#sketch-merging-mode) for more information. | `AUTO` |
 
+## Joins
+
+Joins in multi-stage queries use one of two algorithms, based on the value of `sqlJoinAlgorithm`. It is not possible to
+mix different join algorithms for different joins that appear in the same query.
+
+### Broadcast
+
+Set `sqlJoinAlgorithm` to `broadcast`.
+
+The default join algorithm for multi-stage queries is a broadcast hash join, which is similar to how
+[joins are executed with native queries](../querying/query-execution.md#join). First, any adjacent joins are flattened
+into a structure with a "base" input (the bottom-leftmost one) and other leaf inputs (the rest). Next, any subqueries
+that are inputs the join (either base or other leafs) are planned into independent stages. Then, the non-base leaf
+inputs are all connected as broadcast inputs to the "base" stage.
+
+Together, all of these non-base leaf inputs must not exceed the [limit on broadcast table footprint](#limits). There
+is no limit on the size of the base (leftmost) input.
+
+Only LEFT JOIN, INNER JOIN, and CROSS JOIN are supported with with `broadcast`.
+
+Join conditions, if present, must be equalities. It is not necessary to include a join condition; for example,
+`CROSS JOIN` and comma join do not require join conditions.
+
+As an example, the following statement has a single join chain where `orders` is the base input, and `products` and
+`customers` are non-base leaf inputs. The query will first read `products` and `customers`, then broadcast both to
+the stage that reads `orders`. That stage loads the broadcast inputs (`products` and `customers`) in memory, and walks
+through `orders` row by row. The results are then aggregated and written to the table `orders_enriched`. The broadcast
+inputs (`products` and `customers`) must fall under the limit on broadcast table footprint, but the base `orders` input
+can be unlimited in size.
+
+```
+REPLACE INTO orders_enriched
+OVERWRITE ALL
+SELECT
+  orders.__time,
+  products.name AS product_name,
+  customers.name AS customer_name,
+  SUM(orders.amount) AS amount
+FROM orders
+LEFT JOIN products ON orders.product_id = products.id
+LEFT JOIN customers ON orders.customer_id = customers.id
+GROUP BY 1, 2
+PARTITIONED BY HOUR
+CLUSTERED BY product_name
+```
+
+### Sort-merge
+
+Set `sqlJoinAlgorithm` to `sortMerge`.
+
+Multi-stage queries can use a sort-merge join algorithm. With this algorithm, each pairwise join is planned into its own
+stage with two inputs. The two inputs are partitioned and sorted using a hash partitioning on the same key. This
+approach is generally less performant, but more scalable, than `broadcast`. There are various scenarios where broadcast
+join would return a [`BroadcastTablesTooLarge`](#errors) error, but a sort-merge join would succeed.
+
+There is no limit on the overall size of either input, so sort-merge is a good choice for performing a join of two large
+inputs, or for performing a self-join of a large input with itself.
+
+There is a limit on the amount of data associated with each individual key. If _both_ sides of the join exceed this
+limit, the query returns a [`TooManyRowsWithSameKey`](#errors) error. If only one side exceeds the limit, the query
+does not return this error.
+
+Join conditions, if present, must be equalities. It is not necessary to include a join condition; for example,
+`CROSS JOIN` and comma join do not require join conditions.
+
+All join types are supported with `sortMerge`: LEFT, RIGHT, INNER, FULL, and CROSS.
+
+As an example, the following statement runs using a single sort-merge join stage that receives `eventstream`
+(partitioned on `user_id`) and `users` (partitioned on `id`) as inputs. There is no limit on the size of either input.
+
+```
+REPLACE INTO eventstream_enriched
+OVERWRITE ALL
+SELECT
+  eventstream.__time,
+  eventstream.user_id,
+  eventstream.event_type,
+  eventstream.event_details,
+  users.signup_date AS user_signup_date
+FROM eventstream
+LEFT JOIN users ON eventstream.user_id = users.id
+PARTITIONED BY HOUR
+CLUSTERED BY user
+```
+
 ## Sketch Merging Mode
 This section details the advantages and performance of various Cluster By Statistics Merge Modes.
 
@@ -256,7 +342,8 @@ The following table lists query limits:
 | Number of output columns for any one stage. | 2,000 | `TooManyColumns` |
 | Number of cluster by columns that can appear in a stage | 1,500 | `TooManyClusteredByColumns` |
 | Number of workers for any one stage. | Hard limit is 1,000. Memory-dependent soft limit may be lower. | `TooManyWorkers` |
-| Maximum memory occupied by broadcasted tables. | 30% of each [processor memory bundle](concepts.md#memory-usage). | `BroadcastTablesTooLarge` |
+| Maximum memory occupied by broadcasted tables. Only relevant when `sqlJoinAlgorithm` is `broadcast` (the default). | 30% of each [processor memory bundle](concepts.md#memory-usage). | `BroadcastTablesTooLarge` |
+| Maximum memory occupied by buffered data during sort-merge join. Only relevant when `sqlJoinAlgorithm` is `sortMerge`. | 10 MB | `TooManyRowsWithSameKey` |
 
 <a name="errors"></a>
 
@@ -266,7 +353,7 @@ The following table describes error codes you may encounter in the `multiStageQu
 
 | Code | Meaning | Additional fields |
 |---|---|---|
-| `BroadcastTablesTooLarge` | The size of the broadcast tables used in the right hand side of the join exceeded the memory reserved for them in a worker task.<br /><br />Try increasing the peon memory or reducing the size of the broadcast tables. | `maxBroadcastTablesSize`: Memory reserved for the broadcast tables, measured in bytes. |
+| `BroadcastTablesTooLarge` | The size of the broadcast tables used in the right hand side of the join exceeded the memory reserved for them in a worker task. Only occurs when `sqlJoinAlgorithm` is `broadcast` (the default).<br /><br />Try increasing the peon memory or reducing the size of the broadcast tables. | `maxBroadcastTablesSize`: Memory reserved for the broadcast tables, measured in bytes. |
 | `Canceled` | The query was canceled. Common reasons for cancellation:<br /><br /><ul><li>User-initiated shutdown of the controller task via the `/druid/indexer/v1/task/{taskId}/shutdown` API.</li><li>Restart or failure of the server process that was running the controller task.</li></ul>| |
 | `CannotParseExternalData` | A worker task could not parse data from an external datasource. | `errorMessage`: More details on why parsing failed. |
 | `ColumnNameRestricted` | The query uses a restricted column name. | `columnName`: The restricted column name. |
@@ -285,8 +372,9 @@ The following table describes error codes you may encounter in the `multiStageQu
 | `TooManyBuckets` | Exceeded the number of partition buckets for a stage. Partition buckets are only used for `segmentGranularity` during INSERT queries. The most common reason for this error is that your `segmentGranularity` is too narrow relative to the data. See the [Limits](#limits) table for the specific limit. | `maxBuckets`: The limit on buckets. |
 | `TooManyInputFiles` | Exceeded the number of input files/segments per worker. See the [Limits](#limits) table for the specific limit. | `numInputFiles`: The total number of input files/segments for the stage.<br /><br />`maxInputFiles`: The maximum number of input files/segments per worker per stage.<br /><br />`minNumWorker`: The minimum number of workers required for a successful run. |
 | `TooManyPartitions` | Exceeded the number of partitions for a stage. The most common reason for this is that the final stage of an INSERT or REPLACE query generated too many segments. See the [Limits](#limits) table for the specific limit. | `maxPartitions`: The limit on partitions which was exceeded |
-|  `TooManyClusteredByColumns`  | Exceeded the number of cluster by columns for a stage. See the [Limits](#limits) table for the specific limit. | `numColumns`: The number of columns requested.<br /><br />`maxColumns`: The limit on columns which was exceeded.`stage`: The stage number exceeding the limit<br /><br /> |
+| `TooManyClusteredByColumns` | Exceeded the number of cluster by columns for a stage. See the [Limits](#limits) table for the specific limit. | `numColumns`: The number of columns requested.<br /><br />`maxColumns`: The limit on columns which was exceeded.`stage`: The stage number exceeding the limit<br /><br /> |
 | `TooManyColumns` | Exceeded the number of columns for a stage. See the [Limits](#limits) table for the specific limit. | `numColumns`: The number of columns requested.<br /><br />`maxColumns`: The limit on columns which was exceeded. |
+| `TooManyRowsWithSameKey` | The number of rows for a given key exceeded the maximum number of buffered bytes on both sides of a join. See the [Limits](#limits) table for the specific limit. Only occurs when `sqlJoinAlgorithm` is `sortMerge`. | `key`: The key that had a large number of rows.<br /><br />`numBytes`: Number of bytes buffered, which may include other keys.<br /><br />`maxBytes`: Maximum number of bytes buffered. |
 | `TooManyWarnings` | Exceeded the allowed number of warnings of a particular type. | `rootErrorCode`: The error code corresponding to the exception that exceeded the required limit. <br /><br />`maxWarnings`: Maximum number of warnings that are allowed for the corresponding `rootErrorCode`. |
 | `TooManyWorkers` | Exceeded the supported number of workers running simultaneously. See the [Limits](#limits) table for the specific limit. | `workers`: The number of simultaneously running workers that exceeded a hard or soft limit. This may be larger than the number of workers in any one stage if multiple stages are running simultaneously. <br /><br />`maxWorkers`: The hard or soft limit on workers that was exceeded. |
 | `NotEnoughMemory` | Insufficient memory to launch a stage. | `serverMemory`: The amount of memory available to a single process.<br /><br />`serverWorkers`: The number of workers running in a single process.<br /><br />`serverThreads`: The number of threads in a single process. |

diff --git a/...nsions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java b/...nsions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/ControllerImpl.java
@@ -44,9 +44,10 @@
 import org.apache.druid.frame.key.ClusterBy;
 import org.apache.druid.frame.key.ClusterByPartition;
 import org.apache.druid.frame.key.ClusterByPartitions;
+import org.apache.druid.frame.key.KeyColumn;
+import org.apache.druid.frame.key.KeyOrder;
 import org.apache.druid.frame.key.RowKey;
 import org.apache.druid.frame.key.RowKeyReader;
-import org.apache.druid.frame.key.SortColumn;
 import org.apache.druid.frame.processor.FrameProcessorExecutor;
 import org.apache.druid.frame.processor.FrameProcessors;
 import org.apache.druid.indexer.TaskState;
@@ -130,12 +131,12 @@
 import org.apache.druid.msq.input.stage.StageInputSpecSlicer;
 import org.apache.druid.msq.input.table.TableInputSpec;
 import org.apache.druid.msq.input.table.TableInputSpecSlicer;
+import org.apache.druid.msq.kernel.GlobalSortTargetSizeShuffleSpec;
 import org.apache.druid.msq.kernel.QueryDefinition;
 import org.apache.druid.msq.kernel.QueryDefinitionBuilder;
 import org.apache.druid.msq.kernel.StageDefinition;
 import org.apache.druid.msq.kernel.StageId;
 import org.apache.druid.msq.kernel.StagePartition;
-import org.apache.druid.msq.kernel.TargetSizeShuffleSpec;
 import org.apache.druid.msq.kernel.WorkOrder;
 import org.apache.druid.msq.kernel.controller.ControllerQueryKernel;
 import org.apache.druid.msq.kernel.controller.ControllerStagePhase;
@@ -595,8 +596,8 @@ public void updatePartialKeyStatisticsInformation(int stageNumber, int workerNum
           final StageDefinition stageDef = queryKernel.getStageDefinition(stageId);
           final ObjectMapper mapper = MSQTasks.decorateObjectMapperForKeyCollectorSnapshot(
               context.jsonMapper(),
-              stageDef.getShuffleSpec().get().getClusterBy(),
-              stageDef.getShuffleSpec().get().doesAggregateByClusterKey()
+              stageDef.getShuffleSpec().clusterBy(),
+              stageDef.getShuffleSpec().doesAggregate()
           );
 
           final PartialKeyStatisticsInformation partialKeyStatisticsInformation;
@@ -1361,7 +1362,7 @@ private static QueryDefinition makeQueryDefinition(
 
     if (MSQControllerTask.isIngestion(querySpec)) {
       shuffleSpecFactory = (clusterBy, aggregate) ->
-          new TargetSizeShuffleSpec(
+          new GlobalSortTargetSizeShuffleSpec(
               clusterBy,
               tuningConfig.getRowsPerSegment(),
               aggregate
@@ -1583,7 +1584,7 @@ private static List<String> computeShardColumns(
       final ColumnMappings columnMappings
   )
   {
-    final List<SortColumn> clusterByColumns = clusterBy.getColumns();
+    final List<KeyColumn> clusterByColumns = clusterBy.getColumns();
     final List<String> shardColumns = new ArrayList<>();
     final boolean boosted = isClusterByBoosted(clusterBy);
     final int numShardColumns = clusterByColumns.size() - clusterBy.getBucketByCount() - (boosted ? 1 : 0);
@@ -1593,11 +1594,11 @@ private static List<String> computeShardColumns(
     }
 
     for (int i = clusterBy.getBucketByCount(); i < clusterBy.getBucketByCount() + numShardColumns; i++) {
-      final SortColumn column = clusterByColumns.get(i);
+      final KeyColumn column = clusterByColumns.get(i);
       final List<String> outputColumns = columnMappings.getOutputColumnsForQueryColumn(column.columnName());
 
       // DimensionRangeShardSpec only handles ascending order.
-      if (column.descending()) {
+      if (column.order() != KeyOrder.ASCENDING) {
         return Collections.emptyList();
       }
 
@@ -1679,8 +1680,8 @@ private static Pair<List<DimensionSchema>, List<AggregatorFactory>> makeDimensio
     // Note: this doesn't work when CLUSTERED BY specifies an expression that is not being selected.
     // Such fields in CLUSTERED BY still control partitioning as expected, but do not affect sort order of rows
     // within an individual segment.
-    for (final SortColumn clusterByColumn : queryClusterBy.getColumns()) {
-      if (clusterByColumn.descending()) {
+    for (final KeyColumn clusterByColumn : queryClusterBy.getColumns()) {
+      if (clusterByColumn.order() == KeyOrder.DESCENDING) {
         throw new MSQException(new InsertCannotOrderByDescendingFault(clusterByColumn.columnName()));
       }
 
@@ -2123,7 +2124,7 @@ private void startStages() throws IOException, InterruptedException
           segmentsToGenerate = generateSegmentIdsWithShardSpecs(
               (DataSourceMSQDestination) task.getQuerySpec().getDestination(),
               queryKernel.getStageDefinition(shuffleStageId).getSignature(),
-              queryKernel.getStageDefinition(shuffleStageId).getShuffleSpec().get().getClusterBy(),
+              queryKernel.getStageDefinition(shuffleStageId).getClusterBy(),
               partitionBoundaries,
               mayHaveMultiValuedClusterByFields
           );

diff --git a/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/Limits.java b/extensions-core/multi-stage-query/src/main/java/org/apache/druid/msq/exec/Limits.java
@@ -68,4 +68,10 @@ public class Limits
    * Maximum size of the kernel manipulation queue in {@link org.apache.druid.msq.indexing.MSQControllerTask}.
    */
   public static final int MAX_KERNEL_MANIPULATION_QUEUE_SIZE = 100_000;
+
+  /**
+   * Maximum number of bytes buffered for each side of a
+   * {@link org.apache.druid.msq.querykit.common.SortMergeJoinFrameProcessor}, not counting the most recent frame read.
+   */
+  public static final int MAX_BUFFERED_BYTES_FOR_SORT_MERGE_JOIN = 10_000_000;
 }