Broadcast Manager is a Spark service to manage broadcast values in Spark jobs. It is created for a Spark application as part of SparkContext’s initialization and is a simple wrapper around BroadcastFactory.
Broadcast Manager tracks the number of broadcast values (using the internal field nextBroadcastId
).
The idea is to transfer values used in transformations from a driver to executors in a most effective way so they are copied once and used many times by tasks (rather than being copied every time a task is launched).
When BroadcastManager
is initialized an instance of BroadcastFactory
is created based on spark.broadcast.factory setting.
BroadcastFactory
is a pluggable interface for broadcast implementations in Spark. It is exclusively used and instantiated inside of BroadcastManager
to manage broadcast variables.
It comes with 4 methods:
-
def initialize(isDriver: Boolean, conf: SparkConf, securityMgr: SecurityManager): Unit
-
def newBroadcast[T: ClassTag](value: T, isLocal: Boolean, id: Long): Broadcast[T]
- called afterSparkContext.broadcast()
has been called. -
def unbroadcast(id: Long, removeFromDriver: Boolean, blocking: Boolean): Unit
-
def stop(): Unit
The BroadcastManager
implementation used in Spark by default is org.apache.spark.broadcast.TorrentBroadcast
(see spark.broadcast.factory). It uses a BitTorrent-like protocol to do the distribution.
TorrentBroadcastFactory
is the factory of TorrentBroadcast
-based broadcast values.
When a new broadcast value is created using SparkContext.broadcast()
method, a new instance of TorrentBroadcast
is created. It is divided into blocks that are put in Block Manager.
When spark.broadcast.compress is true
(default), compression is used.
There are the following compression codec implementations available:
-
lz4
ororg.apache.spark.io.LZ4CompressionCodec
-
lzf
ororg.apache.spark.io.LZFCompressionCodec
- a fallback whensnappy
is not available. -
snappy
ororg.apache.spark.io.SnappyCompressionCodec
- the default implementation
An implementation of CompressionCodec
trait has to offer a constructor that accepts SparkConf
.
-
spark.broadcast.factory
(default:org.apache.spark.broadcast.TorrentBroadcastFactory
) - the fully-qualified class name for the implementation ofBroadcastFactory
interface. -
spark.broadcast.compress
(default:true
) - a boolean value whether to use compression or not. See Compression. -
spark.io.compression.codec
(default:snappy
) - compression codec to use. See Compression. -
spark.broadcast.blockSize
(default:4m
) - the size of a block