layout | title | nav_order |
---|---|---|
page |
Configuration |
3 |
There are many configurations could impact the Gluten Plugin performance and can be fine-tuned in Spark. You can add these configurations into spark-defaults.conf to enable or disable the setting.
Parameters | Description | Recommend Setting |
---|---|---|
spark.driver.extraClassPath | To add Gluten Plugin jar file in Spark Driver | /path/to/jar_file |
spark.executor.extraClassPath | To add Gluten Plugin jar file in Spark Executor | /path/to/jar_file |
spark.executor.memory | To set up how much memory to be used for Spark Executor. | |
spark.memory.offHeap.size | To set up how much memory to be used for Java OffHeap. Please notice Gluten Plugin will leverage this setting to allocate memory space for native usage even offHeap is disabled. The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Gluten Plugin |
30G |
spark.sql.sources.useV1SourceList | Choose to use V1 source | avro |
spark.sql.join.preferSortMergeJoin | To turn off preferSortMergeJoin in Spark | false |
spark.plugins | To load Gluten's components by Spark's plug-in loader | com.intel.oap.GlutenPlugin |
spark.shuffle.manager | To turn on Gluten Columnar Shuffle Plugin | org.apache.spark.shuffle.sort.ColumnarShuffleManager |
spark.gluten.enabled | Enable Gluten, default is true | true |
spark.gluten.sql.columnar.scanOnly | When enabled, this config will overwrite all other operators' enabling, and only Scan and Filter pushdown will be offloaded to native. | false |
spark.gluten.sql.columnar.batchscan | Enable or Disable Columnar BatchScan, default is true | true |
spark.gluten.sql.columnar.hashagg | Enable or Disable Columnar Hash Aggregate, default is true | true |
spark.gluten.sql.columnar.project | Enable or Disable Columnar Project, default is true | true |
spark.gluten.sql.columnar.filter | Enable or Disable Columnar Filter, default is true | true |
spark.gluten.sql.columnar.codegen.sort | Enable or Disable Columnar Sort, default is true | true |
spark.gluten.sql.columnar.window | Enable or Disable Columnar Window, default is true | true |
spark.gluten.sql.columnar.shuffledHashJoin | Enable or Disable ShuffledHashJoin, default is true | true |
spark.gluten.sql.columnar.forceShuffledHashJoin | Force to use ShuffledHashJoin over SortMergeJoin, default is true | true |
spark.gluten.sql.columnar.sort | Enable or Disable Columnar Sort, default is true | true |
spark.gluten.sql.columnar.sortMergeJoin | Enable or Disable Columnar Sort Merge Join, default is true | true |
spark.gluten.sql.columnar.union | Enable or Disable Columnar Union, default is true | true |
spark.gluten.sql.columnar.expand | Enable or Disable Columnar Expand, default is true | true |
spark.gluten.sql.columnar.broadcastExchange | Enable or Disable Columnar Broadcast Exchange, default is true | true |
spark.gluten.sql.columnar.broadcastJoin | Enable or Disable Columnar BroadcastHashJoin, default is true | true |
spark.gluten.sql.columnar.shuffle.codec | Set up the codec to be used for Columnar Shuffle. If this configuration is not set, will check the value of spark.io.compression.codec. By default, Gluten use software compression. Valid options for software compression are lz4, zstd. Valid options for QAT and IAA is gzip. | lz4 |
spark.gluten.sql.columnar.shuffle.codecBackend | Enable using hardware accelerators for shuffle de/compression. Valid options are QAT and IAA. | |
spark.gluten.sql.columnar.shuffle.compressionMode | Setting different compression mode in shuffle, Valid options are buffer and rowvector, buffer option compress each buffer of RowVector individually into one pre-allocated large buffer, rowvector option first copies each buffer of RowVector to a large buffer and then compress the entire buffer in one go. | buffer |
spark.gluten.sql.columnar.numaBinding | Set up NUMABinding, default is false | true |
spark.gluten.sql.columnar.coreRange | Set up the core range for NUMABinding, only works when numaBinding set to true. The setting is based on the number of cores in your system. Use 72 cores as an example. |
0-17,36-53 |18-35,54-71 |
spark.gluten.sql.native.bloomFilter | Enable or Disable native runtime bloom filter. | true |
spark.gluten.sql.columnar.wholeStage.fallback.threshold | Configure the threshold for whether whole stage will fall back in AQE supported case by counting the number of ColumnarToRow & vanilla leaf node | >= 3 |
spark.gluten.sql.columnar.query.fallback.threshold | Configure the threshold for whether query will fall back by counting the number of ColumnarToRow & vanilla leaf node | >= 1 |
spark.gluten.sql.columnar.maxBatchSize | Set the number of rows for the output batch | 4096 |
spark.gluten.shuffleWriter.bufferSize | Set the number of buffer rows for the shuffle writer | value of spark.gluten.sql.columnar.maxBatchSize |
spark.gluten.loadLibFromJar | Controls whether to load dynamic link library from a packed jar for gluten/cpp. Not applicable to static build and clickhouse backend. | false |
spark.gluten.sql.columnar.force.hashagg | Force to use hash agg to replace sort agg. | true |
spark.gluten.sql.columnar.vanillaReaders | Enable vanilla spark's vectorized reader. Please note it may bring perf. overhead due to extra data transition. We recommend to disable it if most queries can be fully offloaded to gluten. | false |
Below is an example for spark-default.conf, if you are using conda to install OAP project.
##### Columnar Process Configuration
spark.sql.sources.useV1SourceList avro
spark.plugins io.glutenproject.GlutenPlugin
spark.shuffle.manager org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.driver.extraClassPath ${GLUTEN_HOME}/package/target/gluten-<>-jar-with-dependencies.jar
spark.executor.extraClassPath ${GLUTEN_HOME}/package/target/gluten-<>-jar-with-dependencies.jar
######