layout	title	nav_order
page	Configuration	3

Spark Configurations for Gluten Plugin

There are many configurations could impact the Gluten Plugin performance and can be fine-tuned in Spark. You can add these configurations into spark-defaults.conf to enable or disable the setting.

Parameters	Description	Recommend Setting
spark.driver.extraClassPath	To add Gluten Plugin jar file in Spark Driver	/path/to/jar_file
spark.executor.extraClassPath	To add Gluten Plugin jar file in Spark Executor	/path/to/jar_file
spark.executor.memory	To set up how much memory to be used for Spark Executor.
spark.memory.offHeap.size	To set up how much memory to be used for Java OffHeap. Please notice Gluten Plugin will leverage this setting to allocate memory space for native usage even offHeap is disabled. The value is based on your system and it is recommended to set it larger if you are facing Out of Memory issue in Gluten Plugin	30G
spark.sql.sources.useV1SourceList	Choose to use V1 source	avro
spark.sql.join.preferSortMergeJoin	To turn off preferSortMergeJoin in Spark	false
spark.plugins	To load Gluten's components by Spark's plug-in loader	com.intel.oap.GlutenPlugin
spark.shuffle.manager	To turn on Gluten Columnar Shuffle Plugin	org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.gluten.enabled	Enable Gluten, default is true. Just an experimental property. Recommend to enable/disable Gluten through the setting for `spark.plugins`.	true
spark.gluten.memory.isolation	(Experimental) Enable isolated memory mode. If true, Gluten controls the maximum off-heap memory can be used by each task to X, X = executor memory / max task slots. It's recommended to set true if Gluten serves concurrent queries within a single session, since not all memory Gluten allocated is guaranteed to be spillable. In the case, the feature should be enabled to avoid OOM. Note when true, setting spark.memory.storageFraction to a lower value is suggested since storage memory is considered non-usable by Gluten.	false
spark.gluten.sql.columnar.scanOnly	When enabled, this config will overwrite all other operators' enabling, and only Scan and Filter pushdown will be offloaded to native.	false
spark.gluten.sql.columnar.batchscan	Enable or Disable Columnar BatchScan, default is true	true
spark.gluten.sql.columnar.hashagg	Enable or Disable Columnar Hash Aggregate, default is true	true
spark.gluten.sql.columnar.project	Enable or Disable Columnar Project, default is true	true
spark.gluten.sql.columnar.filter	Enable or Disable Columnar Filter, default is true	true
spark.gluten.sql.columnar.codegen.sort	Enable or Disable Columnar Sort, default is true	true
spark.gluten.sql.columnar.window	Enable or Disable Columnar Window, default is true	true
spark.gluten.sql.columnar.shuffledHashJoin	Enable or Disable ShuffledHashJoin, default is true	true
spark.gluten.sql.columnar.forceShuffledHashJoin	Force to use ShuffledHashJoin over SortMergeJoin, default is true	true
spark.gluten.sql.columnar.sort	Enable or Disable Columnar Sort, default is true	true
spark.gluten.sql.columnar.sortMergeJoin	Enable or Disable Columnar Sort Merge Join, default is true	true
spark.gluten.sql.columnar.union	Enable or Disable Columnar Union, default is true	true
spark.gluten.sql.columnar.expand	Enable or Disable Columnar Expand, default is true	true
spark.gluten.sql.columnar.broadcastExchange	Enable or Disable Columnar Broadcast Exchange, default is true	true
spark.gluten.sql.columnar.broadcastJoin	Enable or Disable Columnar BroadcastHashJoin, default is true	true
spark.gluten.sql.columnar.shuffle.codec	Set up the codec to be used for Columnar Shuffle. If this configuration is not set, will check the value of spark.io.compression.codec. By default, Gluten use software compression. Valid options for software compression are lz4, zstd. Valid options for QAT and IAA is gzip.	lz4
spark.gluten.sql.columnar.shuffle.codecBackend	Enable using hardware accelerators for shuffle de/compression. Valid options are QAT and IAA.
spark.gluten.sql.columnar.shuffle.compressionMode	Setting different compression mode in shuffle, Valid options are buffer and rowvector, buffer option compress each buffer of RowVector individually into one pre-allocated large buffer, rowvector option first copies each buffer of RowVector to a large buffer and then compress the entire buffer in one go.	buffer
spark.gluten.sql.columnar.shuffle.realloc.threshold	Set the threshold to dynamically adjust the size of shuffle split buffers. The size of each split buffer is recalculated for each incoming batch of data. If the new size deviates from the current partition buffer size by a factor outside the range of [1 - threshold, 1 + threshold], the split buffer will be re-allocated using the newly calculated size	0.25
spark.gluten.sql.columnar.numaBinding	Set up NUMABinding, default is false	true
spark.gluten.sql.columnar.coreRange	Set up the core range for NUMABinding, only works when numaBinding set to true. The setting is based on the number of cores in your system. Use 72 cores as an example.	0-17,36-53 \|18-35,54-71
spark.gluten.sql.native.bloomFilter	Enable or Disable native runtime bloom filter.	true
spark.gluten.sql.columnar.wholeStage.fallback.threshold	Configure the threshold for whether whole stage will fall back in AQE supported case by counting the number of ColumnarToRow & vanilla leaf node	>= 3
spark.gluten.sql.columnar.query.fallback.threshold	Configure the threshold for whether query will fall back by counting the number of ColumnarToRow & vanilla leaf node	>= 1
spark.gluten.sql.columnar.maxBatchSize	Set the number of rows for the output batch	4096
spark.gluten.shuffleWriter.bufferSize	Set the number of buffer rows for the shuffle writer	value of spark.gluten.sql.columnar.maxBatchSize
spark.gluten.loadLibFromJar	Controls whether to load dynamic link library from a packed jar for gluten/cpp. Not applicable to static build and clickhouse backend.	false
spark.gluten.sql.columnar.force.hashagg	Force to use hash agg to replace sort agg.	true
spark.gluten.sql.columnar.vanillaReaders	Enable vanilla spark's vectorized reader. Please note it may bring perf. overhead due to extra data transition. We recommend to disable it if most queries can be fully offloaded to gluten.	false

Below is an example for spark-default.conf, if you are using conda to install OAP project.

##### Columnar Process Configuration

spark.sql.sources.useV1SourceList    avro
spark.plugins    io.glutenproject.GlutenPlugin
spark.shuffle.manager    org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.driver.extraClassPath    ${GLUTEN_HOME}/package/target/gluten-<>-jar-with-dependencies.jar
spark.executor.extraClassPath    ${GLUTEN_HOME}/package/target/gluten-<>-jar-with-dependencies.jar
######

Additionally, you can control the configurations of gluten at thread level by local property.

Parameters	Description	Recommend Setting
gluten.enabledForCurrentThread	Control the usage of gluten at thread level.	true

Below is an example of developing an application using scala to set local properties.

##### Before executing the query, set local properties

sparkContext.setLocalProperty(key, value)
spark.sql("select * from demo_tables").show()
######

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration.md

Configuration.md

Spark Configurations for Gluten Plugin

Files

Configuration.md

Latest commit

History

Configuration.md

File metadata and controls

Spark Configurations for Gluten Plugin